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October 13, 1997 


Dear Colleague: 

Welcome to the Tenth Anniversary Microprocessor 
Forum. It is your participation that makes the event 
such a success. To make the most of your time at 
Microprocessor Forum here are some reminders: 

Can be picked up at the materials tables in the Market Street Foyer (conference) and Regency 
Foyer (seminars). Microprocessor Forum Chip Portfolios are handed out at the beginning of 
the Literature and Demonstration Center reception in the Regency Ballroom Tuesday 
evening. Please present the coupons included in your registration envelope to collect your 
materials. Due to non-disclosure agreements we can not hand out materials in advance of the 
conference or seminars. Proceedings of the conference and copies of the seminar workbooks 
are available for sale at the registration area. 

Name Badges 

Must be worn at all times to gain admittance to Forum events. 

Meals 

We invite you to join us for complimentary continental breakfast and lunch on conference 
days and any seminar day you are registered to attend. These meals will be served in the 
Imperial and Market Street Foyers and in the Regency Ballroom. 

Receptions 

Please plan to join us for the Microprocessor Forum Receptions. Monday night we offer a 
welcome reception in the Market Street Foyer. On Tuesday is the grand Literature & 
Demonstration Center reception where you can pick up your Forum Chip Portfolio and get 
more information about emerging technologies from 35 participating companies. 

Affinity Sessions 

Are scheduled for 8:30 Tuesday evening. Sessions on “IA-64 and the Future of Microprocessor 
Architecture,” u The Future of the PC Infrastructure: Socket 7 vs. Slot 1,” “The Future of 3D 
Graphics Acceleration,” and “Prospering in New Computing Arenas: Success Beyond Intel’s 
Domain” are slated. More details on these sessions is included in your notebook. 

Evaluations 

Please take time throughout the conference to fill out the evaluation form found in the inside 
pocket of your attendee notebook. Completed evaluations will be entered in a raffle for a Sony 
DVD player and a Palm Pilot PDA. 

Need Help? 

Just ask anyone wearing a staff ribbon. The MicroDesign Resources team is here to help make 
the conference more productive and enjoyable for you. Thanks for joining us! 


Sincerely 



Founder and Editorial Director, MicroDesign Resources 
Director, Microprocessor Forum 
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1997 Microprocessor Forum 
Event Locations 




MARKET STREET FOYER 


Forum Registration, Badge 
and Material Pick-up 


Continental Breakfast: 
AM & PM Breaks: 
Lunch: 

Reception: 


Monday— Thursday 
Tuesday & Wednesday 
Monday— Thursday 
Monday 
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Affinity Sessions 

8:30-10:00 pm 
Tuesday, October 14, 1997 


IA-64 and the future of microprocessor architecture 

The emergence of the Intel/HP IA-64 architecture promises to create the biggest shift in the microprocessor landscape 
in many years. The outline of the technology has begun to emerge, but many questions remain. Join Peter Christy, 
president of MDR, to discuss issues including: 

How much of a performance advantage over existing RISCs will the IA-64 64-bit instruction set offer? 

What will Merced look like? 

Will this be the crippling blow for the RISC vendors? 

When will it become significant in the x86 PC market? 


SSKS 
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The Future of the PC Infrastructure: Socket 7 vs. Slot 1 

The split in strategy between Intel and its x8 6 competitors is fracturing the PC hardware infrastructure. Join industry 
analyst Bert McComas of InQuest in a discussion of questions including: 

It Can Intel drive a rapid migration to Slot 1? 

What will motivate users to make the switch? 

What will the long-term differences in system performance be? 

How long can Intel’s competitors keep Socket 7 viable? 

What will Intel's competitors do beyond Socket 7? ■ Valiev Room 

How long will Slot 1 live? What’s next? 


tm 




m 

Wit 


ma 


m 

»»> 


The Future of 3D Graphics Acceleration 

No PC technology is advancing as rapidly as 3D acceleration. The rapid pace of change, varied tradeoffs, and differing 
agendas among the industry participants has created uncertainty about which approach will be most successful in the 
coming years. Join Peter Glaskowsky, MDR’s 3D and multimedia analyst, to discuss issues including: 

Evolution of 3D benchmarks (with a short presentation from Bill Catchings, ZD Benchmark Operation) 

Is Talisman dead? What is the future for advanced rendering architectures? 

Will geometry processing move off the host processor for mainstream PCs? 

What is the future for OpenGL and other non-DirectX APIs? 


m 
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Prospering in New Computing Arenas: Success Beyond Intel's Domain 

In the face of Intel’s overwhelming success in PCs, many investors are looking at technologies outside of Intel’s focus as 
the best opportunities for building future businesses. Join venture capitalist Andy Rappaport of August Capital to dis- 
cuss questions such as: 

II What are the prospects for non-PC computing devices, including set-top boxes, Web terminals, organizers, 
and game consoles? 

II How far will Intel’s reach extend into non-PC arenas? 

U What are the architectural and market requirements for successful non-PC computing devices? 

IS Is this just a place of refuge, or the next big opportunity? 
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CONFERENCE PROGRAM 

October 14- 15, 1997 
Fairmont Hotel, San Jose 


Conference 8:30 
Program 
Day One 


Looking Forward, Looking Back: Welcome to the 10th 

Michael Slater , MDR 


Annual Microprocessor Forum 


Keynote Speech: A New World Order— Alternative Microsoft Windows Platforms 

Jerry Sanders , AMD 


Tuesday, 
October 14 


10:05 


x86 Microprocessors 

Moderator: Michael Slater, MDR 

I Competitive Strategies for x86 Microprocessors 
Michael Slater, MDR 

■ Pentium II Design Enhancements 
Robert Colwell, Intel 

M The AMD-K6 Plus: An Enhanced K6 Microprocessor 
Greg Favor, AMD 

Break Hosted by AMD 


■ A New, High-Performance x86 Microprocessor 
Robert Maher, Cyrix 

■ An Enhanced C6-Family Microprocessor 
Glenn Henry, Centaur Technology 

M Panel. Design Challenges for Next- Generation x86 Microprocessors 
All speakers above 

11:45 Lunch Hosted by IBM x86 Microprocessors 


1:00 


3:10 

3:30 


5:35 

6:00 

8:30 


The IA-64 64-Bit Instruction Set Architecture 

Moderator: Linley Gwennap 

■ Motivations and Design Approach for a New 64-Bit Instruction Set Architecture 
John Crawford, Intel and Jerry Huck, Hewlett-Packard 

■ Q&A 

■ Intel Architecture Roadmap 
Fred Pollack, Intel 

■ Panel: Intel Architecture Strategies for High-Performance Computing 

Frank Artale, Microsoft; Steve Chen, Sequent Computer Systems; Les Crudele, Compaq- 
Dan Glessner, Hewlett-Packard; Russell Holt, NCR; Trey Smith, IBM 

Architecture at HP: Two Decades of Innovation 

Joel Birnbaum, Hewlett-Packard 

Break Hosted by Fujitsu 
High-Performance RISC Microprocessors 

Moderator: Linley Gwennap 

■ The Evolving RISC Landscape 
Linley Gwennap, MDR 

■ Suns Next-Generation High-End SPARC Microprocessor 
Gary Lauterbach, Sun Microsystems 

■ A Next-Generation 64- Bit PowerPC Microprocessor 
Mark Papermaster, IBM 

■ PA-8500: Scaling the PA-8200 With a Large Integrated Cache 
Bill Queen, Hewlett-Packard 

■ A SPARC Microprocessor for High-End Servers 
Hisashige Ando, HAL Computer Systems 

■ Panel: Maximizing RISC Microprocessor Performance 

All speakers above, plus Earl Killian, MIPS ; Dan Leibholz, Digital Semiconductor 

Microprocessor Report Awards 

Nick Tredennick, Tredennick Inc. 

Reception, Literature and Demo Center Hosted by Team MIPS 
Affinity Sessions 
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Conference 
Program 
Day Two 

Wednesday, 
October 15 


8:30 Microprocessors for Embedded Applications 

Moderator: Jim Turley, MDR 

■ The Changing Face of Embedded Design 
Jim Turley, MDR 

■ A New, Low-Power Architecture for Embedded Applications 
John Arends, Motorola 

■ Sun’s New microjava Processor 
Harlan McGhan, Sun Microsystems 

■ A High-Performance, Multimedia StrongARM Microprocessor 
Robert Stepanian, Digital Semiconductor 

■ An Integrated Pentium-Class Processor for Internet Applications 
Dan O y Neill, National Semiconductor 

10:05 Break Hosted by Motorola 

■ A Truly Unified Microcontroller-DSP Architecture 
Rod Fleck, Siemens 

■ ARM 9TDMI: The Next-Generation Thumb Processor for Higher Performance Applications 
Ian Devereux, ARM Ltd. 

■ The Hyperstone E1-32X Multimedia Processor for Embedded Systems 
Manfred Schlett, Hyperstone 

■ Panel: Optimizing Microprocessor Designs for Emerging Embedded Applications 
All speakers above 

12:00 Lunch Hosted by Rambus, Inc. 


1:15 Next-Generation DRAMs 

Moderator: Peter Song, MDR 

■ Direct Rambus: The New Memory Standard 
Allen Roberts, Rambus 

■ Panel: Alternatives for Next-Generation DRAMs 

David Bondurant, Enhanced Memory Systems; Jack Konrath, Fujitsu; 
Allen Roberts, Rambus; Farhad Tabrizi, SLDRAM Consortium 

2:10 Multimedia Acceleration 

Moderator: Peter N. Glaskowsky, MDR 

■ Alternatives for Multimedia Acceleration 
Peter N. Glaskowsky, MDR 

■ A High-Performance x86 Processor with Integrated 3D Graphics 
Doug Beard, Cyrix 

■ A Mobile 3D Accelerator with Embedded DRAM 
Ronda Collier, S3 

3:00 Break 

■ The E4 MPEG-2 Video Codec 
Les Kohn, C-Cube Microsystems 

■ ManArray Technology: The Scalable Future of Signal Processing 
Gerald Pechanek, BOPS 

■ Panel: Acceleration Strategies for Multimedia 
All speakers above 

4:30 Panel: Opportunities for Future Microprocessors 

Moderator: Michael Slater 

Tom Beaver, Motorola 

l^es Crudele, Compaq 

John Mashey, Silicon Graphics 

Fred Pollack, Intel 

Andy Rappaport, August Capital 

Atiq Roza, AMD 

5:30 Conference Adjourned 







SPEAKER BIOGRAPHIES 





Hisashige Ando 

is vice president of processor development of 
Hal Computer Systems. He’s been with Hal 
for 5 years and has been involved in the devel- 
opment of three generations of SPARC V9 
processors. 



Joel Birnbaum 

is Hewlett-Packard’s senior vice president of 
research and development, and director of 
Hewlett-Packard Laboratories. He has 
supervised R&D teams responsible for RISC 
architecture at HP and IBM. 



John Arends 

area of expertise at Motorola has been RISC 
microprocessor design and implementation, 
including both 88000 and PowerPC. His 
current focus centers around the development 
of a new low-power high-performance 
microprocessor architecture for embedded 
applications. 



David Bondurant 

is director of marketing and applications for 
Enhanced Memory Systems. He is responsible 
for strategic and tactical marketing, 
applications engineering, marketing 
communications, and public relations. 



Douglas Beard 

is MX microprocessor design manager at 
Cyrix Corporation, which he joined in 1993. 
Previously he was with Supercomputer 
Systems as chief hardware architect and 
hardware design group manager, and with 
Gould Computer Systems Division as project 
manager. 



Peter Christy 

took over the reins as president of 
MicroDesign Resources in 1996. He directs all 
aspects of the company including operations, 
business planning, product development, 
marketing, and new business development. 

He also contributes on computer trends and 
marketing for Microprocessor Report. 



Thomas Beaver 

is corporate vice president and director of 
marketing and sales with the networking and 
computing systems group at Motorola’s semi- 
conductor products sector. In his 32 year 
career with Motorola, he has held various 
positions within the company. 



Ronda Collier, 

engineering manager for the mobile products 
group at S3, is responsible for mobile graphics 
products from definition through implemen- 
tation. She also manages the design team, as 
well as interfacing with the marketing and 
central engineering organizations. 
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Robert Colwell 

manages the architecture team that is designing the next-genera- 
tion processor within the Microprocessor Development Division 
at Intel’s Hillsboro, Oregon facility. He joined Intel in 1990 as a 
senior architect on the P6 project, and became manager of the 
architecture group two years later. In 1996 he was named an 
Intel Fellow. 



Rod Fleck 

is director of the 32-bit microcontroller group 
at Siemens Components, where he’s been an 
engineer and technical manager since 1984. 

He worked on the development team for 
AMD’s 29000 architecture, defined Siemens’ 
16-bit architecture, and was director of hard- 
ware/software coverification at Synopsys. 


John Crawford 

is the director of microprocessor architecture 
at Intel. He manages the development of 
future microprocessor architectures, and the 
development of the simulation tools neces- 
sary to validate functional completeness and 
performance. 


Peter Glaskowsky 

is senior analyst of 3D and multimedia 
technology for MicroDesign Resources and is 
one of the industry’s leading graphics 
analysts. He came to MDR from Integrated 
Device Technology, where he was a chief 
engineer in the systems technology group. 




Lester Crudele 

is vice president and general manager of the 
workstation division in the enterprise 
computing group at Compaq Computer 
Corporation. He is responsible for driving 
the product development, strategy, and vision 
for the workstation division, including engi- 
neering, sales, and marketing. 


Linley Gwennap 

is editor in chief and publisher of 
Microprocessor Report and director of product 
development for MicroDesign Resources. He 
joined MDR in 1992 after eight years at 
Hewlett-Packard working on RISC systems. 




Ian Devereux 

is the chief engineer of the ARM940T devel- 
opment. He joined Advanced RISC Machines 
in 1995 and has been a major contributor to 
the development of the ARM8 and 
ARM9TDMI processor families. 


Glenn Henry 

has been president of Centaur Technology 
since April, 1995. Previously, he was a 
consultant to MIPS Technology (SGI) as well 
as the chief technology officer and senior vice 
president of the product group for Dell 
Computer Corporation. 





Greg Favor 

is a senior fellow at Advanced Micro Devices, 
where he is responsible for future x86 proces- 
sor architecture development efforts. Prior to 
this, he was chief processor architect and 
then director of K6 processor development. 



Jerry Huck 

is the manager of processor architecture with- 
in Hewlett-Packard’s computer organization. 
His time is divided between managing a small 
team responsible for processor simulators, 
platform architecture definition, and partici- 
pating in the joint definition of processor 
architecture with Intel. 
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Earl Killian 

is director of architecture for MIPS 
Technologies. Previously, he cofounded 
Quantum Effect Design, participating in 
developing the R4600, R4650, and R4700 
MIPS RISC processors. 



John Mashey 

is director, systems technology in Silicon 
Graphics R&D. He’s an ancient UNIX person, 
having started work on it at Bell Labs in 1973. 
He has worked on and managed projects in 
both commercial and technical computing, 
helped design the MIPS architecture, and was 
also one of the founders of SPEC. 



Les Kohn 

is a C-Cube fellow and chief architect of the 
DVx product family. Prior to joining C-Cube, 
he was the chief architect for Sun’s 
UltraSPARC and Intel’s 860 processors. 



Jack Konrath 

is director of strategic marketing for memory 
products at Fujitsu Microelectronics. Most of 
his 28 years of experience have been in 
semiconductor memory-related positions, 
the rest in core logic and graphics. 



Gary Lauterbach 

is distinguished engineer and chief architect 
for the UltraSPARC-Ill microprocessor at 
Sun Microsystems, and is responsible for 
research, design, and development of high- 
performance processors. 


Daniel Leibholz 

is a microprocessor architect in the semiconductor division of 
Digital Equipment Corporation, where he architected major 
sections of the Alpha 21264. He has interests in the areas of 
processor architecture, performance analysis, and system design. 



Robert Maher 

is vice president of engineering at Cyrix 
Corporation and is responsible for future 
processor designs. One of Cyrix’s initial 
design engineers, he was project director for 
the Cyrix 5x86 processor, and was responsible 
for development of the M2 processor. 


Harlan McGhan 

is the technical marketing group manager for the Sun 
Microelectronics volume products division. In addition to serv- 
ing as a principal evangelist for Sun’s new line of JavaChips, he 
focuses on high-level product planning, market requirements, 
and new product definitions. 

Dan O’Neill 

is director of National Semiconductor’s integrated processor 

unit, which develops and markets x86-based 

functionally integrated products. He started the unit in 1993. 



Mark Papermaster 

is project manager for the POWER3 
microprocessor at IBM. He has worked for 15 
years in microelectronics and microprocessor 
engineering, including as manager of circuit 
and physical design for the PowerPC 601 
microprocessor. 



Gerald Pechanek, 

inventor of the ManArray, is chief technical 
officer and cofounder of BOPS. He has over 
25 years of development experience. 



Fred Pollack 

is the director of the measurement, 
architecture, and planning group at Intel. He 
directs the planning for Intel’s future micro- 
processors. His group is responsible for all 
Intel core platform architecture and 
performance analysis. 
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Bill Queen 

is a project manager for the 
PA-8500 design in Hewlett-Packard's 
Microprocessor Lab in Fort Collins, Colorado. 
He previously worked on the PA- 8000 and 
PA-7100LC. 


r Michael Slater 

is founder and principal analyst of 
'H| MicroDesign Resources, editorial 
fwm director of Microprocessor Report, and pro- 

gram director for Microprocessor Forum. He 
| ‘ igr * W 3 was previously an independent engineering 
\||jjj consultant and an R8cD engineer at Hewlett- 



Andy Rappaport 

is a general partner at August Capital. An 
oft-cited authority on semiconductor and 
hardware design technologies, he has written 
and lectured extensively on the changing 
structure of the semiconductor, computer, and 
telecommunications equipment industries. 



Peter Song 

is senior analyst for MicroDesign Resources 
and senior editor of Microprocessor Report. He 
contributes analysis of high-performance 
microprocessors. He joined MDR from 
Samsung, where he was a senior manager and 
principal engineer. 


S. Atiq Raza 

is senior vice president and chief technical 
officer, as well as a member of the board of 
directors for AMD. Prior to the merger of 
Advanced Micro Devices and NexGen, he was 
the president and chief executive officer of 
NexGen. 


Robert Stepanian 

is a senior computer architect with Digital 
Equipment Corporation, involved in the 
architecture definition of the StrongARM 
processor with the programmable media 
unit. 





Allen Roberts 

is vice president and general manager of the 
memory and technology division of Rambus. 
Prior to joining Rambus, he served as director 
of high-end engineering at MIPS. 



Farhad Tabrizi 

is chairman of the SLDRAM Consortium and 
director of strategic marketing at Hyundai 
Electronics. At Hyundai, he is responsible for 
setting strategic directions for future memory 
products. 



W.J. (Jerry) Sanders III 

is chairman of the board and chief executive 
officer of Advanced Micro Devices. He 
cofounded the company in 1969 with seven 
others and has been CEO since its inception. 



Nick Tredennick 

would like to be president of TechNerds 
International. He has patents, publications, 
experience, and the usual degrees, but is hav- 
ing trouble connecting “does well” with “pays 
well.” 



Manfred Schlett 

joined hyperstone in 1993. He was responsi- 
ble for the DSP concept of the El -32 
RISC/DSP architecture and is now project 
manager for the European consortium 
EURICO developing the next hyperstone 
RISC/DSP generation. 




Jim Turley 

is senior analyst and senior editor of 
Microprocessor Report specializing in high- 
performance embedded microprocessors. He 
joined MDR in 1994 after devoting more 
than a dozen years to design engineering, 
engineering management, product marketing, 
and program management. 
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Keynote: 

A New World 
Order 
Alternative 
Microsoft 
Windows 
Platforms 


Jerry Sanders, AMD 
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“And the platform will continue to 
evolve from the connected PC of 
the mid-90s to the visual computing 
platform of the late-90s” 


Andrew S. Grove 
Chief Executive Officer 
Intel Corporation 
Remarks at Comdex 19 
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133MHz AGP 






33MHz PCI 







AMO-KS 3D Processor 
0.25-micron process 
31 mm 2 die size 
9.3 M transistors 
300 > 350 MHz 


AMD-KS MMX Enhanced Processor 

0.25-micron process 

53 mm* die size 

3.8 M transistors 

2S5 MHz 

2HS7 


AMO-KS MMX Enhanced Processor 

0.35-micron process 

1S2 mrrf die size 

SJS M transistors 

233 MHz 

1HS7 
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AMD-KS* 3D Processor 


AMO-K6 3D Processor 
0,25-micron process 
SI mm 2 die size 
$.3 M transistors 
300 >350 MHz 
1H9S 


AMD-KS MMX Enhanced Processor 
0.35-micron process 
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Driven by customer requirements 

Clock speeds in excess of 500 MHz 

Advanced bus interface, “Alpha” EV6 bus protocol 

Plan of record : slot “A” mechanically identical to Intel’s slot 


Much Faster Performance 
More Compelling Features 


AMD-KS 12 23 MMX 
Enhanced Processor 


Pentium it 233 Processor 
w/ MMX Technoiogy 


2.5GS Hard Drive 
16MB EDO-DRAM 
High Performance 
2MB Video Card 
16x CO-ROM 
33.6 Fax Modem 
16-Bit Audio 

Windows 95 * S/W Bundle 
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AMD-K6™ Processor Product Roadmap 


AMD3 


AMD-K6™ MMX™ Enhanced Processor 

Product Roadmap 
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AMD-K6, AMD* 3D, and Super7 ;tre all registered tnidemarks of Advanced Micro Devices. 
MMX is u trademark of the Intel Coiporation. 


Agenda 

AMDH 

• AMD Super7 Platform Initiative 

• Advancing the Socket 7 Platform 


• AMD-K6™ Product Roadmap 

• AMD-K6 3D 

• AMD-K6+ 3D 


• AMD-3D™ Technology 

• 3D Multimedia Instruction Set 



AMD ft). work 


Enhanced Processor Microprocessor Forum *97 
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AMD-K6™ Processor Product Roadmap 


Super7: Socket 7 Enhancement Initiative AMDH 

• Key Performance Enhancements 

• Addition of AGP in ‘97 

• Higher local bus frequency - 100 MHz 

• Support for 100 MHz frontside cache 

• Maintains socket 7 compatibility and cost advantages 

• At Leading Edge of all System Feature Advancements 

• Today: USB, SDRAM, UDMA, ACPI 

• Future: AGP, PC 98, 100 MHz bus, 100 MHz SDRAM, 1394, etc 

• System Logic Vendors: ALi, National, SiS, VIA and AMD 


AMD is committed to leading the socket 7 
infrastructure to higher performance! 
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AMD-K6 Super7 Roadmap 


AMD3 


“For a uniprocessor system, the Pentium® bus is just as good as - if 
not better than - the P6 bus.”! 

Michael Slater, Microprocessor Report Dec 30, 1996 
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Socket 7 
66 MHz bus 
Ultra DMA 
SDRAM 
PCI, ACPI 
USB 


1H’97 


Adds: 

AGP 


Adds: 

100 MHz 
L2 cache & 
local bus 


Adds: full speed backside 
L2 cache, optional 
frontside L3 cache 
and 1394 


2H’97 


1H’98 


2H’98 
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AMD-K6™ Processor Product Roadmap 


AMD-K6 Processor Family Roadmap AMDn 


400 


300 


200 


.35 ji process 
MMX Enhanced 
8.8M Transistors 
162 mm 2 



AMD-K6 


AMD-K6 


AMD-K6 3D 


.25 p process 
Higher Speeds 
Lower Power 
8.8M Transistors 
68 mm 2 
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AMD-K6 3D™ New Processor Features AMD£I 


• AMD-3D Technology 

• Instruction set extensions to accelerate 3D graphics, audio and other 
multimedia applications 

• Superscalar MMX Units 

• Dual decode and dual execution pipelines 

• Maintains the K6 advantage of low execution latencies 

• No decode pairing restrictions 

• Only one cycle misalignment penalty on memory accesses 

• 100MHz Local Bus 

• Increases local bus and L2 cache bandwidth by 50% 

• Redesigned I/O timing to allow for low cost 100 MHz motherboard 


9.3 Million Transistors on a Die of 81mm 2 
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Level-One Cache 
Controller 


Scheduler 

Butter 

(24 RISC86) 


Instruction 
Control Unit 


Branch 
Resolution Unit 


Register Unit X 
(Integer /MMX/JD) 


Floating- Point 
Unit 


Register Unit Y 
(Integer /MMX/3D) 


Store 

Queue 


AMD-K6™ Processor Product Roadmap 


Acceleration of Multimedia Applications AMD3 


Multimedia applications have grown to become an integral 
part of the PC platform 

• But multimedia algorithms are very computation intensive 


Application or 
Game Physics 


floating point 
intensive 


Geometry transform, 
Clipping, & Lighting 


floating point 
intensive 


Triangle 
Set up 

JL. 


floating point & 
integer intensive 


Pixel 

Rendering 


integer 

intensive 
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MMX extensions where added to accelerate integer multimedia algorithms, 
but the impact on the user experience has been limited since MMX 
accelerated only some computations. 


AMD-K6 3D Block Diagram 


AMDS 


32KByte Level-One Instruction Cache 


64 Entrv ITLB 


20KByte Predecode Cache 


16 Byte Fetch 


Branch Logic 
(8192-Entry BHT) 
(16-Entry BTC) 
(16-Entry RAS) 


Dual Instruction Decoders 

x86 to RISC66 


100 Mhz 
Super7 
Bus 

Interface 


Level-One Dual-Port Data Cache 

(32KByte) 


128 Entry DTLB 


AMD 
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AMD-3D Technology 


AMD3 


Why a New Technology? 

• Generally only graphics pixel rendering has been accelerated by MMX and 3D 
graphics hardware; focus has been on integer performance 

• 3D graphics performance is now limited by the earlier floating point intensive 
stages of the graphics processing pipeline 

• Realistic physical modeling is also becoming a necessity 

What is it? 

• A new set of instructions to greatly accelerate floating point computations 

• Multiple floating point operations per clock 

• Defined and implemented in collaboration with leading ISV’s 

• Works in concert with graphics accelerator cards by speeding up the front end 
of the graphics pipeline 


AMD (S)M'ork 
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AMD-3D Technology 


AMDCI 

Application or 

Geometry transform, 

Triangle 

Pixel 

Game Physics 

Clipping, & Lighting 

Set up 

Rendering 
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floating point 
intensive 


floating point 
intensive 


floating point & 
integer intensive 


integer 

intensive 


AMD-3D Technology Accelerates Graphic Cards Accelerates 

Benefits 

• Relieves the floating point intensive bottlenecks in 3D graphics processing 

• Allows for more detailed physics-based modeling and simulations - more objects 
with accurate physical characteristics displayed at life-like speeds. 

• Accelerates most floating point intensive multimedia operations: 

• Graphics pipeline (Physics, Geometry, & Set up) 

• Audio processing (AC-3 and 3D) 


A MD work 
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AMD-3D Technology 

AMDS! 

• SIMD Floating Point Instructions 

• Supports IEEE Single Precision Data Type 

• Two 32-bit FP values per 64-bit reg/mem operand 

• Uses MMX Registers 


• 24 New Instructions 

• PFMUL, PFADD, PFSUB, PFCMP, PF2I, PI2F,etc. 

• Similar encoding fomiat to MMX Instructions 


• Streamlined for High Performance 

• Saturating arithmetic 

• No exceptions 

• Limited rounding modes 

• No switching overhead between MMX and AMD-3D instructions 

• Avoids X87 register stack 


midWMM 
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AMD-3D Technology: Software Support 


AMD3 


Enthusiastic Support from Major ISV’s 
Full Software Development Support 

• Full Microsoft Support 

• Assembly language - native Microsoft MASM support 

• Fully optimized API and libraries at introduction: 
DirectX (Direct3D & DirectSound) and OpenGL 

• Profiler and optimizer tools 

• AMD SDK available to AMD NDA partners 

Dedicated AMD Development Support Group 
No Core OS Support Required 


AMD (b) wor/c 
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AMD-K6™ Processor Family Roadmap AMDiPI 
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AMD-K6+ 3D 
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AMD-K6+ 3D Processor Features 


AMDCI 


• On Chip Full-speed Backside Level 2 Cache 

• Operates at IX processor frequency 

• 4-1-1-1 (-1-1-1-1) access timing (Peak Bandwidth 3.2 GB/sec at 400 MHz) 

• 256Kbyte 

• 2-way set associative 

• In addition to current 64 KB LI cache 

• Improved Write Buffering, Pipelining, and Combining 

• Large Level 2 TLB - Higher Performance Paging 

• Maintains Full Socket 7 Compatibility 

• 100 MHz frontside local bus 

• Optional very large frontside level 3 cache 

• 21.3 Million Transistors on a Die of 135 mm 2 
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AMDS! 


AMD-K6+ 3D Block Diagram 
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Summary 


AMD3 


• Super7: 100 MHz frontside bus with AGP 

• Leading edge platform initiative that achieves Socket 7 architectural 
performance improvements 


• Robust AMD-K6 Product Roadmap for ‘98 

• Major CPU architectural performance improvements 

• Higher frequencies 

• 100% compatibility with Socket 7/Super7 infrastructure 

• On-chip level 2 cache 

• AMD-3D Technology: 

• New instruction set of extensions that greatly accelerate 3D graphics, audio and 
other floating point intensive multimedia algorithms 


AMD, Enabling a Higher Level of Application Realism. 
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Evolution of IA-32 System Design 
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lution of IA-32 



Market Forces Driving 

Km# 

Design Changes 

• Exploding demand for high-performance mid-range 
and above servers and workstations based on 
standard platforms 

- CPU performance continues skyrocketing 

- L2 caches getting bigger at similar pace 

- Multiprocessing now needed and feasible 

• Large market segments justify differentiation 

- “Purpose-built” processor products 

- Platform specific product form factors following segments 

P6 core designed to serve multiple market segments 

I 
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Evolution of IA-32 System Design 


CPU Trends 
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' Future IA-32 CPUs 


2G0fyiM * 300MHz Pentium II 

Pentium Pr^ 266MHz Pentium* II 

ISOM Hz Pentium Pro 
1 33MHz Pentium 
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Lee scale, desktop workSoeds 
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Deschutes: next IA-32 CPU 

- P6 at 333+ MHz in 1998 

Moore’s Law performance 
spiral continues 

Big caches and 
multiprocessing now 
standard in high end 
market segments 

Faster microarchitectures 
make more demands on 
cache & memory 


intgl. 
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Industry Cache Trends 



estimated 
industry — 
performance 


• Cache sizes double for 
each new process 

• Can combine multiple 
SRAMs for large caches 

- If product engineered for it! 

• Performance scaling not 
linear with size 

- Latencies not improving as 
fast as capacity 

- Still, larger means higher 
system performance on 
server workloads 
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Must build in headroom for future caches 
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Evolution of IA-32 System Design 






Design Problem: 

System Bottlenecks 

I 

• New features, faster CPUs require faster rest-of- 
system (else bottlenecks arise) 

- MP, PCI, 1394, USB, hi-res graphics, real-time video 

• Backside L2 caches, bigger sizes, latencies, fast 
pipelining 

• Larger main memories 

• Need faster system buses 

- Especially with stream-oriented workloads 

- Good solution for Pentium® processor system bus (socket 7) 

- But unsuited for multiprocessor workstations & servers 

Intel. 






Design Problem: One CPU 
Multiple Market 


desktop 


* Cost effective cartridge 
design for high volume 
manufacturing (slotl) 

* Better management of 
electricals and thermals 

workstation I (Socket 8) I m* * ^*** f/ Qs> *1-2 way MP 
sservar •Specialized form factor 

# IS? . * Dual Independent Bus “ 

* P6 uniprocessor architecture adds 1 0-20% 

performance loss 


for mobile 


system performance 
Scalable, glueless MP 


* Larger full speed L2 
cache 

• 1-4 way MP 
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Evolution of IA-32 System Design 


Design Response: Slot 1 

• High performance at volume price points 

- Still want/need performance boost from fast nearby L2 cache 

- Cartridge manufacturable at very large volumes 

- Cartridge permits close control of electricals and thermals 

• Commodity SRAMs help economics 


- Vs. Pentium® Pro processor’s custom SRAMs 

• Desktop DP-only implies smaller caches ok 

- Cartridge fixed in size, thermals, power supply to control cost 

The Pentium® II processor Slot 1 cartridge 




was created to get P6 microarchitecture 

to volume markets 



Design Response: Slot 2 


« Large market segments emerged above desktop 

- Require highest performance, higher cost ok 

- Can handle larger cartridges, thermals, power supply 

• 4 way multiprocessing needs bigger caches, higher 
system bus bandwidth 

- Custom SRAMs ok economically 

• Full speed caches need careful cartridge electricals 

- Manage longer latencies with latency hints, upstream 
caches, streaming buffers, compilers... 

- Beyond 200MHz: Source Synchronous signaling 

Slot 2 does not replace Slot 1, it complements Slot 1 

intJ 
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Design Response: Source 
Synchronous Signalling 


Processor 


Cache 


Processor 


Data 
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Common clock 
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Bus clock 


Data 


Cache 


Strobe 
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Common clock technique 


Source synchronous technique 


Microprocessor Forum 


Page 5 


October 14-15, 1997 








Conclusions 


P6 microarchitecture enables specialized 
products for each market segment 

- One CPU core, multiple cache/bus solutions 

- Complementary two-slot strategy balances design 
requirements of fast cache buses with market-specific 
cost constraints 


Strategy: Compelling processor products to all 
market segments 

- Intel IA-32 continues focus on ail but highest end 
segments 
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IDT WinChip C6+: 

The Next Generation 


WinChip, C6 & IDT-C6, C6+ are trademarks of Integrated Device Technology Inc. 



WinChip Background 

■ Started 4/95 (4 in my kitchen, home PC, etc) 


■ Designed IDT-C6 Processor (Ann’d 5/97) 

• P55-compatible (Includes MMX™) 

• Optimized for business applications 

• Smallest 0.35p die (88 mm 2 ) lowest cost 

• 8.9 W max power at 200 MHz (3.3V) 

■ Started Production Shipments In Sept 


Designed for 



M icrosoft 


• Windows certified, XXCAL Platinum certified Windows* 95 

• 4 BIOSes available, 10’s qualified boards 

• Targeting tier 3 sub-$1000 PCs 

• Primarily via distribution (HHT, Wyle) 

■ 180 ($90) & 200 MHz ($135) Shipping Now 

• 225 & 240 Sample In November MMX is a trademark of Intel Corporation 2 
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Next Generation Goals (C6+ “) 

■ “2x” MMX & FP CPI Improvement 

• For those who care (not everyone) 

• Primary FP focus is 3D graphics 

■ Some Winstone CPI Improvement 

• Already very competitive 

■ Same Small Die Size (at same 0.35p) 

• Exploit our small-size design & methodology 

■ Support New Technologies 

• 2.5V transistor & 0.25p shrink 

• Faster MHz, lower cost, less power 

■ Ship 6 Months After C6 

• Exploit our fast development cycle 
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IDT WinChip C6+ 


IDT C6+ MMX Improvements 

■ Full MMX Pairing (Superscalar) 

• Dual MMX units ala P55 (1 multiplier, 1 shifter) 

• 2 Instructions decoded, issued & executed per clock 

• Same pairing rules as P55 

■ Faster MMX Instruction Timing Than P55 

• 1 cycle multiply latency (vs 3) ! 

• 1 cycle multiply-add latency (vs 3) ! 

• 1 cycle store (vs 2) 

■ No MMX-Integer Pairing 

• But, larger caches, larger TLBs, etc. 

■ Result: Better MMX Performance Than P55 

• Even on biased Intel Media Benchmark! 


IDT C6+ FP Improvements 

■ Fully Pipelined (Complete Redesign Same Size) 

■ Some Instructions Slower Than P55 

• FMUL * 4/2 or 3/1 (SP) (vs 3/1) 

• FXCHG * 1 (vs 1/0) 

• 1 clock penalty on register-memory forms 

■ Some Instructions Faster Than P55 

• FST * 1 (vs 2-3), FIST, FSTSWAX-SAHF, etc. 

■ Results: 80-100% P55 FP Performance 

• 80% on biased Intel Media Benchmark (FP) 

• Typically same/faster P55 on real FP applications 

■ We Intentionally Stopped Here! 

• Silicon better spent on 3D assists/Winstone/etc. 





IDT WinChip C6+ 

i 


3D Graphics Improvements 

■ 53 New x86 Instructions (12 x86 opcodes) 

• Designed to speed coord transforms/lighting 

• Addresses 20-50% CPU time in heavy-duty 3D 

■ 22 Additional FP Registers 

■ Minimal Die Size Impact (< 1mm 2 ) 

• Primarily reuse of existing dataflow 

• New instruction decodes, a few muxes, etc. 

■ Intended For Distribution In Future MS D3D 

• Working with Microsoft to add our code 

• Automatically selected machine-specific code 

• Transparent to applications & users 




3D Graphics Improvements 

■ Base Instruction Functions (Hdw) 

• 22 additional FP registers (30 total) 

• 1 -clock multiply-accumulate 

• 1 -clock load of 2 single-precision (SP) values 

• 1 -clock compare & set bit flags 

• Fast SP inverse square root, square root, divide 

• 1 -clock convert to/from integer 

• Fast & flexible moves to/from integer registers 

• Individual instruction control of precision 

■ Microcode Instructions For Future Speedups 

• Multiple clocks initially single clock in future 

• 2x & 4x multiply-accumulates 

• Store 2 single-precision values at once 
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3D Graphics Improvements 


Results (SP data) 

■ [x y z] Transform (12 values) 

■ 1/sqrt(x 2 +y 2 +z 2 ) 


P55 1 C6+ 


34 


14 elks 


[x y z] Edge Detect (6) 


125 (hw) 29 
=80 (sw) 

37 8 


Total 196 


51 


etc. 


Notes 


1. Highly optimized low-level assembler code for P55. 
Much better than any real code that we have found in 
programs/benchmarks. 

2. Highly optimized low-level assembler code for IDT C6+. 


IDT C6+ Integer Improvements 

■ Many Misc Improvements 

• Multiply ^ 6 (vs 10 for P55) 

• Load-ALU -Store 2 (vs 3 for P55) 

• No penalty for OF prefixes 

• Reduced AGIs, no penalty on base+index, etc. 

■ Limited Instruction Pairing 

• PUSH-PUSH, POP-POP 

• Pairs executed in same clock 

■ Generate Up to 4 Micro Instructions per Clock 

• Helps fill queue faster key to branch prediction 

■ Great Branch Prediction 

• Better performance & much smaller than P55 
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IDT C6+ Branch Prediction 




100% Address 
Prediction on 

99% of Branches |_ Queue 

Predicted Taken 

- 2 elks if Q empty 1 

- else 1 elk (85%) Execution 
Predicted Not Taken Pipeline 
- 1 clock 


12-bit history 


Br Pattern 


IP 


4/C x 1 
entries 






1 GSHARE " 


Branch History 
Table 


“Smart" 

Default 

Prediction 




Agree/disagree 
from BUT 


T/NT 


Highly Accurate 
“Two Level” 
Prediction 


n 




IDT C6+ Branch Prediction 



P55 

C6 

BTB/BHT Bits 

34 Kb 

4 Kb 

Correctly Predicted Branch 

1 

1-2 elks 
(1.1 avg) 

Mispredict Time 

5-6 

4 elks 

Predict Rates 

- Norton SI32 

- Winstone Business 

77% 

82% 

89% 

93% 

Avg Branch Clocks 

- Norton SI32 

- Winstone Business 

2.03 

1.81 

1.41 U- 
1.30 1 



Microprocessor Forum 


Page 6 


October 14-15, 1997 





IDT 








r 


IDT WinChip C6+ 

s 


IDT C6+ Summary 


■ 2x C6 MMX CPI Performance 

• Better Than P55 MMX (on its own benchmark)! 

■ 2x C6 FP CPI Performance 

• Worst-case 80% of P55 (biased Intel Media BM) 

■ 2-4x P55 CPI on Specific 3D Graphics Kernels 

• Will be transparently available via MS D3D 

■ 6% Winstone CPI Improvement 

• As good as any other socket 7 at same MHz 

■ Support For New Technology (Split Voltage) 

■ 91 mm 2 vs. 88 for Current C6 (both 0.35p) 



IDT C6+ Summary 

■ Production Tapeout 11/15 

■ Target Shipments 2Q98 

• Samples 1Q98 

■ Two Technology Options 

• 3.3V core 

- Up to 266 MHz 

- Split V core allows reduced power & MHz 

• 2.5V core 

- 300 MHz at first ship 

- 40% lower power (at same MHz) 

- 3 months after 3.3V version 
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Competitive Strategies for x86 Microprocessors 


Competitive Strategies for 
x86 Microprocessors 

Michael Slater 
Principal Analyst 
MDR 


Changes in the Past Year 

♦ Rapid shift to processors with MMX 

• Pentium Pentium/MMX 

• Pentium Pro -» Pentium II 
. K5 K6 

• 6x86 6x86MX 

♦ Intel’s Pentium II introduces Slot 1 

♦ 3D generates broad interest in FP 

♦ Cyrix acquired by National 

♦ IDT/Centaur joins the fray 
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Competitive Strategies for x86 Microprocessors 


MMX Takes Over 

♦ Rapid shift to MMX processors driven 
more by other features of those 
processors than by MMX itself 

• Larger caches, faster clock speeds 

♦ Marketing has convinced users to 
want MMX, but applications are few 

• MMX has become a standard part of 
the architecture, making it largely a 
non-issue 


Growth in the Low End 

♦ Fastest growth in low-cost PCs 

♦ Cyrix’s MediaGX won role at Compaq 
by offering price advantage 

• First successful high-integration PC 
processor 

♦ Intel’s focus on higher price points 

makes this segment an opportunity for 
Intel’s competitors 

• But it is hard to make much money 
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System Interface Trends 


66-MHz 1 00-MHz Socket 7 
Socket 7 * Socket 7 * w/On-Chip L2 


Pentium Pro — ^ 66-MHz — ► 1 00-MHz 

Slot 1 V Slot 1 

1 00-MHz 
Slot 2 


Questions for 1 998 

♦ Will Intel’s competitors be able to match 
Intel’s FP/MMX performance? 

♦ Will Intel succeed in moving the 
mainstream market to Slot 1 ? 

♦ Can Intel’s competitors keep Socket 7 
performance competitive? 

♦ Will proprietary instruction set extensions 
succeed? 

♦ Is there a role for integrated processors? 
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Beyond the 6x86MX™ Processor 


Robert Maher 

Vice President of Engineering 
Cyrix Corporation 
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7th Generation 


Enhanced FPU /MMX™ 


100 MHz Socket 7 

AGP 

SDRAM 


1997 1998 1999 Cyrbc 
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Beyond the 6x86MX™ Processor 


Socket 7 Today 


Winstone 97® / Windows® 95 


Business Applications 




200 233 266 300 


MHz /PR 



Socket 7: The Future - 6x86MX™ 


♦ AGP support in Q1 1998 

♦ 100 MHz system bus Q1 1998 

> Lookaside caches: Up to 2MB support 

- Tag RAM in north bridge 

- 5ns data RAMS 

- 3-1 -1-1 line fills 

♦ SDRAM today, DDR SDRAM 2H 1998 

♦ Firewire (IEEE 1394), Device Bay, ATA66 
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Beyond the 6x86MX™ Processor 


The Socket 7 System 


♦ Balanced system design 

> Concurrency 

- CPU <— > L2 

- PCI <— > MEM 

- AGP <— > MEM 

> Concurrency 

- CPU <— > MEM 

- CPU <— > AGP 


♦ Multimedia applications 
have data sets > 2MB 
therefore latency to main 
memory is critical 



✓ 

I Socket 7 


/ 






L2 Cache — * 

AGP 


3D 


Graphics 



Socket 7 System Performance 


> As core frequency improves, L2 cache at 
100 MHz keeps LI miss penalty (core clocks) 
consistent with 66 MHz systems 

>100 MHz system bus 50% increase in L2 
cache performance 

> Low latency / High bandwidth path to main 
memory 

- 100 MHz Socket 7 bus 


- 100 MHz SDRAM 

Performance will scale with 100 MHz system bus 



October 14-15, 1997 


Microprocessor Forum 


Page 3 


Beyond the 6x86MX™ Processor 


Beyond 6x86MX™: Cayenne Core 

♦ Based on 6x86MX™ core 

> 4MB paging 

> Virtual mode enhancements 

> Frequency optimizations 

♦ 64KByte LI cache 

♦ Pipelined, dual-issue FPU 

♦ Enhanced MMX technology 

♦ .25 micron process technology 



Cayenne Core: FPU 



♦ Dual-issue floating point 


> 2 FOP 


> 1 load / 1 store, 1 FOP 

> Instruction queue 


♦ Fully pipelined floating point unit 


Throughput / Latency 


Cayenne 

6x86MX 

FXCHG 

0 

2/2 

LOAD/STORE 

1/4 

4/4 

ADD 

1/4 

4/4 

MULTIPLY (SP) 

1/4 

4/4 

MULTIPLY (DP) 

3/6 

6/6 
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Cayenne Core: MMX 

♦ Dual Issue / Dual Execute 

> 1 shift/ 1 multiply 

> 2 MMX add/logical units 

> 1 load / 1 store 

♦ Fully pipelined 

♦ Single-cycle execution 

♦ Multiplies execute with 1 cycle 
throughput, 2 cycle latency 

♦ Single-cycle MMX/FP context switch 

Cyrix 


Cayenne Core: MMXFP 


♦ Enhanced MMX execution unit 

> Additional data types 

— IEEE 754 single-precision floating point 

> Two single-precision floating point results per 
operation 

> Dual execute i=£> 4 FLOPS per clock cycle 

— 1 FP add unit / 1 FP multiply unit 


> 1 GFLOP peak performance 
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Beyond the 6x86MX™ Processor 


MMX and FPU Block Diagram 



Integer Cache 

Unit/ Unit/ 

Instructions Data 




MMXFP Instruction Extensions 

♦ Add, Sub, Multiply, Convert, and Compare 
operations which support FP data types 

♦ Scatter/Gather operations for vectorized 
floating point 

> Gather and scatter triangle vertices for 
optimum parallelism 

♦ Reciprocal and Reciprocal Square Root 

♦ Motion Estimation Instruction 
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Beyond the 6x86MX™ Processor 


MMXFP: Software Support and 
Execution 



♦ Work with Microsoft to support MMXFP in 
retained mode Direct3D driver 


♦ Assist 3rd-party software development: i.e., 
game developers: immediate mode drivers 


Throughput / Latency 

Load/Store/Convert 

MMX 

1/1 

Equiv x86FP 

2/4 

Add/Multiply 

1/3 

2/4 

Reciprocal 

3/5 

48+/48+ 

Root Reciprocal 

3/5 

140+/140+ 


Cyrix 


MMXFP: Geometry Transforms 



T 


geom 


[ 



x’ y’z’w] = [xyz1 


aOO aOI a02 0 
alO all a12 0 
a20 a21 a22 0 
tx ty tz 1 





1000 


1000 

= 

0 10 0 

T = 

0 10 0 


0000 

1 pers 

0 0 11/d 


0 00 1 


0 0 0 0 


♦ 4 x 4 matrix multiply reduces to 3x3 matrix multiply 
with a 3x1 vector add 

♦ Translates 2 vertices at a time in 21 clocks 

♦ 10.5 clocks/vertex vs. 36 clocks/vertex with standard 
x86 code 
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Beyond the 6x86MX rM Processor 


MMXFP: Lighting Calculation 



~ Ij is the intensity at the light source 

- k d is a coefficient of diffuse lighting 

- L is direction vector 

- N is surface normal 


•d = I, *k d * (L o N) 


♦ Processes 2 vertices in 23 clocks 

♦ 11.5 clocks/vertex vs. 129 clocks with standard 
x86 code (70 clock square root) 



>10 million meshed triangles / sec peak performance 



(Geometry + Lighting) 



Summary 



♦ 6x86MX™ to support 100 MHz socket 7 bus 
with state-of-the-art system features: 

AGP, SDRAM, 1394, Device Bay, ATA66 

♦ Cayenne Core: 

> Dual-issue, pipelined FPU/MMX to enable highest 
performance 3D graphics, DVD, and 3D audio 

> MMXFP floating point extensions 

> PR300 to PR400 

> 65 sq mm in .25 um, 5 layer metal process 

> 2H98 production 
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Next Generation instruction Set Architecture 



Next Generation instruction Set 

Architecture 


John Crawford, Intel Fellow 
Director, Microprocessor Architecture 
Intel Corporation 


Jerry Huck 

Manager and Lead Architect 
Hewlett-Packard Company 
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PACKARD 



Objectives 

▲ Unveil the technology behind the next 
generation ISA 

0 Today’s focus on architecture, not implementation 

▲ Context 

• History 

• Motivation 

▲ ISA Preview 

• A few key features 

• Benefits 
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Intel and HP Technology Alliance 

▲ Intel 

• Microprocessor / platform technology 

• 64-bit architecture definition 

A HP 

• Enterprise systems technology expertise 

• Architecture research advancements 

▲ Jointly defined next generation 64-bit 
instruction set 

• Instruction set specification 

• Compiler optimization 

• Performance simulation and projection 

We 1 - GaSESKS 



Instruction Set Architecture (ISA) 
Objectives 

a Enable industry leading system performance 

• Breakthrough performance 

• Headroom 

▲ Enable compatibility with today’s IA-32 
software & PA-RISC software 

▲ Allow scalability over a wide range of 
implementations 

▲ Full 64-bit computing 
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Performance 


Next Generation Instruction Set Architecture 



Current State of The Art 


^ V. 




H/W detects implicit parallelism 
H/W 0-0-0 scheduling & speculation 
H/W renames 8-32 registers to 64+ 


•Simple, fixed length instructions 
•Sequencing done by compiler 


•Complex, variable length instructions. 
•Sequencing done in hardware 


Time 


Intel 


. ... 11 

What’s next, beyond traditional architectures? | 
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Current Performance Limiters: 



iywE '«Xv. 


▲ Mispredicts limit performance 
a Small blocks restrict code scheduling freedom 

* Fragmentation 

• Poor utilization of wide machines 


THEN 



Intel 


unused 

execution slots 
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Current Performance 


$ 


S W - $ 



... ?ori/ 

ft ft 'ftcft fift 


Limiters: 


▲ Memory latency increasing relative to processor 
speed 

▲ Load delay compounded by machine width 

• Latency hiding requires more parallelism in a wide machine 


r 
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mm 
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load 




'.v.*.v.y.\w.*.v.v.v.‘.v.v 

mm 

Hi 

8M 

ilil 


branch is a 
rier 



Scalar 


Intel 


4-way Superscalar 
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Current Performance Limiters 
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a Sequential execution model 


ori< 


source 
coae 


sequential machine 


hardware 



a Compiler has limited, indirect view of hardware 
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Better Strategy : Explicit Parallelism 

Compiler exposes, enhances, and exploits parallelism in the 
source program and makes it explicit in the machine code. 


original source 
code 






X<*f*M*W*X : 

■V.V.V.V.V.VA 


.V.VAVAV.NW 






X-X-X+X-X-X- : 

xSSjSjjSSSKj: 
: :->x*n$m$& 


VM MWWM' 


Intel 


compiler 


‘Expos 



parallel machine 
code 







Exploit 
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Next Generation Architecture Technology 
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000 Superscalar 


Parsifal 

♦ 

'Computing 

H/W detects implicit parallelism 
H/W 0-0-0 scheduling & speculation 
H/W renames 8-32 registers to 64+ 


Simple, fixed length instructions 
Sequencing done by compiler 


•Complex, variable length instructions. 
•Sequencing done in hardware 


Time 


Intel 
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Next Generation Terminology 

▲ EPIC is the next generation technology 

• e.g., RISC, CISC 


▲ IA-64 is the architecture that incorporates 
EPIC Technology 

• e.g., IA-32, PA-RISC 


▲ Merced™ processor is the first IA-64 based 
implementation 

• e.g., Pentium II processor, PA-8500 
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Key 64-bit ISA Features within IA-64 

▲ Architecture Resources 
a Instruction Format 
a Predication 
a Speculation 
a (Branch Architecture) 
a (Floating-Point Architecture) 
a (Multimedia Architecture) 
a (Memory Management & Protection) 
a (Compatibility) 
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Architecture Resources Provide for 
Paraile! Execution & Scalability 



A Massively resourced - large register files 

• Traditional architectures are forced to rename registers 
A inherently scalable - replicated function units 
A Explicitly parallel - transistors used more effectively 

Wei u G3 SEEKS 



Instruction Format: Explicit Parallelism 

a Breaking the sequential execution paradigm 

• Explicit instruction dependency: template 

• Flexibly groups any number of independent instructions 

a Explicitly scheduled parallelism 

• Enables compiler to create greater parallelism 

• Simplifies hardware by removing dynamic mechanisms 

• Fully interlocked- hardware provides compatibility 

12S-fcitbundk> 


o 


instruction 2 

instruction 1 

Instruction 0 



a Modest code size expansion 


The new instruction format enables scalability w/ compatibility 
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Branches Limit Performance 



Traditional Architectures: 4 basic blocks 
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Predication 



if 


then 
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Predication Enhances Parallelism 


Traditional Architectures : 4 basic blocks EPIC Architectures : 1 basic block 



Intel ESSmckaro 


Predication: Features and Benefits 

▲ Compiler given larger scheduling scope 

• Nearly all instructions can be predicated 

• State updated if an instruction’s predicate is true, otherwise 
acts as a NOP 

• Compiler assigns predicates, compare instructions set them 

• Architecture provides 64 1-bit predicate registers (PR) 

▲ Predicated execution removes branches 

• Convert a control dependence to a data dependence 

• Reduce mispredict penalties 

a Parallel execution through larger basic 
blocks 

• Effective use of parallel hardware 

Intel 3 EaSKSE 
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Predication increases Performance 
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Source: iSCA ‘95 S.Mahike, et.aL 
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Memory Latency Causes Delays 

▲ Loads significantly affect performance 

• Often first instruction in dependency chain of instructions 

• Can incur high latencies 


Traditional Architectures 


instr 1 
instr 2 

jump_equ 

Barrier 

v # 


Load — ^ 
use 


a Loads can cause exceptions 
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Speculation 


EPIC Architectures 


ld.s 

instr 1 
instr 2 

jump_equ ] 


\ 

f „ 


chics 

use 

Home Block 



:Exception 


Propagate 

Exception 

:Exceptton 


Detection 


Delivery 


▲ Separate load behavior from exception behavior 

• Speculative load instruction (Id.s) initiates a load operation and 
detects exceptions 

• Propagate an exception “token” (stored with destination register) 
from id.s to chk.s 

• Speculative check instruction (chk.s) delivers any exceptions 
detected by Id.s 

Intel * LV1 PACKARD 



Speculation Minimizes the Effect of 
Memory Latency 


Traditional Architectures EPIC Architectures 



; Exception Detection 


Propagate 

Exception 

; Exception Denver/ 


▲ Give scheduling freedom to the compiler 

• Allows Id.s to be scheduled above branches 

• chk.s remains in home block, branches to fixup code if an 
exception is propagated 
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Example: 8 Queens Loop 

if ((b[j] == true) && (ap+j] == true) && (cp-j+7] == true)) 

Original Code 



True 

38% 



72% 


47% 




13 cycles 

Iny*- 3 potential mispredicts 
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Example: 8 Queens Loop 

if ((b[j] == true) && (ap+j] == true) && (cp-j+7] == true)) 

Original Code Speculation 




13 cycles 

3 potential mispredicts 


9 cycles 

3 potential mispredicts worn Hewlett 
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Example: 8 Queens Loop 

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) 

Spec uja t i on Predication 






9 cycles 

3 potential mispredicts 


7 cycles 

1 potential mispredict 
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Example: 8 Queens Loop 

if ((bQ] == true) && (a[i+j] == true) && (cp-j+7] == true)) 

Original Code 


mmm 
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* RESULT: Almost half the required cycles are reduced 
and 2/3 of the potential mispredicts are eliminated. 

*> 

aa 

Ipi m 

V>V/»VAVAV>VAV>V»V.V.VAV.V.V.V.y.V.V. 

' •V**»%V»*«%V»V»V« , »V«V»V«V»V«W.V.W«W«V 

yXvXvIiyXylv X-'XW 

•_« • . ‘ i . « mwmm 

IvXvXwXvXv 

v .V.V.V.V.V.V.V.V .v.-Xv’ 

,% i ..... . 

.vvy.vvvvvvvvvvvvvvvvvvvvvvXwXvXvXvvvvvvvvX 
.................. »v . *v* 

>v.*.v.v...v.v.v.v.v.v.v.v.v.v.v. v.vv.v.v. v. v.v.v 

Kvv.v.vv.v.v.v.v.v.v.vv.v.v. , .vvvv , . , .vv. , .*. , .v , . , .v,v 
. . . . . »»•».»«••« ««••••. 


Mm 

v.v.v.w .•.Wr! 

S 



iny 


13 cycles 

3 potential mispredicts 


7 cycles 

1 potential mispredict 
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EPIC is the Next Generation Technology 

Expiicitly Parallel Instruction Computing 

▲Explicit parallelism ▲Features that enhance ILP 

* ILP is explicit in machine code • Predication 

* Compiler schedules across a wide scope •Speculation 

* Binary compatibility across all family • Others... 


members 


▲Resources for parallel execution 

• Many registers 

• Many functional units 

• Inherently scalable 


| a sc 

1 * 

► RISC 
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IA-64: EPIC Technology Applied 


▲Enables industry leading performance and capability 

• Explicitly parallel: Beyond the limitations of current architectures 

• Inherently scalable, massively resourced: Provides headroom for future 
market requirements 

• Fully compatible: For existing applications and the future 


▲Addresses server and workstation market 
requirements 

• Enterprise transaction processing 

• Decision support 

• Graphical imaging 

• Volume rendering 

• Many others 
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IA-32 Roadmap 


Speclnt95 ** 
(log scale) 


Higher frequencies via new process technologies 
Microarchitecture enhancements 




• CPU and platform architecture enhancements 

• Higher performance buses 

• New microarchitectures 
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-full-speed caches 

— larger cache sizes 
-higher bus frequencies 

— 4 processor direct connect 

Scalability beyond 4 proc. 
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IA-64 Roadmap 


Speclnt95 
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EPIC technology - beyond RISC 

Optimized for Servers & 
Workstations 



Future IA-64 Processor 
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Merced™ Processor - Program Update 

• Merced™ Processor 

- Code name for Intel’s first IA-64 processor 

- Industry leading performance and features for Servers and 
Workstations 

- Full IA-32 binary compatibility in hardware 

- Intel’s 0.18 micron process technology 

- Multiple product configurations to address specific segment needs 

• Complete solution stack at launch 

- Processor and chipset design teams on track for ‘99 production 

- OEM system design teams are making significant progress 

- Operating systems and key applications on track for launch 
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IA-64 Industry Commitment 
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Summary 

• IA-32 continues to offer high performance solutions for 
Servers, Workstations, Desktops and Mobile 


• IA-64 extends IA in high performance Servers and 
Workstations with full compatibility 


• Continuing major investments in both IA-32 and IA-64 
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Computer Architecture at HP 

• A summary of motivations, innovations 
and implementations in the period 1981 
to the present 

• Some observations about emerging 
systems requirements for the next 
decade and ongoing research to 
address them 
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A Capsule History 

The 80s: 

• RISC consolidation and evolution 

• Spectrum, HP Precision Architecture, and PA-RISC 

• Migration from HP 3000, 1000, M68000 architectures 
The Early 90s: 

• Beyond RISC: The quest for concurrency 

• Superscalar, VLIW, Wide Word 

• Compatibility with PA-RISC 
1 994 - Present: 

• The Intel alliance 

• Next generation technology and IA-64 

WHA T’S NEXT?... 

J8TALKS/MPF1097/MPF1097.PPT 


Precision Architecture Principles 

• Compiler does what it does best 

• Hardware does what it does best 

• As simple as possible, but not simpler 

• Measure / justify everything 

- Optimize for application throughput 

- Most work in least time 

• Architecture scales across family of 
implementations 

• Seamless migration essential 

JBTALKS/MPF 1 0 97/M PF 1 097 . PPT 
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Technology Enablers of RISC 
Architecture 

• Progress in VLSI: Fast registers, cache 

• Globally optimizing compilers 

• Performance measurement and analysis 
tools 
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Challenges of RISC Architecture 

• Compiler accuracy and reliability 

• Migration of legacy code 
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Major PA-RISC Innovations 

• Compound instructions based on usage 
statistics 

• Instruction nullification 

• Legacy software migration 

- Binary code translation 

- Millicode 

- Migration centers 

• 64-bit addresses 

- 32-bit segments 

• Graphics and multimedia extensions 

JBTALKS/MPF1097/MPF1097.PPT 


Why a New Architecture? 

• Insatiable demand for more performance 

• Processor/memory speed gap implications 

- Higher bandwidth 

- More registers 

- Overlapped memory latencies 

• Need for greater number of instructions per 
cycle 

• Diminishing gains from growing 
microprocessor design complexity 
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Major Conclusion: High ILP Needs 
a New Architectural Approach 

• High ILP requires explicitly scheduled 
code 

- Scheduling by compiler 

• Architecture must expose parallelism 

• Scalability across implementations and 
applications required 
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and Microarchitecure 
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Need for a New Industry-Standard 
Architecture 

• HP alone would not succeed with a new 
proprietary architecture: economics, 
acceptance by ISVs 

• Technology alliance melds architecture/ 
design / fabrication excellence of Intel 
with architecture and systems 
excellence of HP 

• Opportunity for scalable common 
hardware platforms across operating 
systems 

JBT ALKS/MPF 1 097/M PF 1 097 . PPT 


The Rest of the System? 


Processor 

Memory 
components 


I/O 

devices 
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The Rest of the System 




CPU Performance (Speclnt95) 
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Changing Workloads 




Web Usage Model 


o 

IO 



0) 

w 



; 

^5 

c 

1 4 

.* V ' > ' •••• ' : 


o 

TS 

a> 


: : " •• • • mm • 1 : 


O * 



c 

c 

o 

■ 


o 

o 

2- 




o-l 

> 100 200 300 400 Si 

#CI»ents 

30 


"■ Static HTML ••••• Dynamic HTML | 



20 

System CPU, I/O Req. 


15 




10 




5 




0 

— 


■ 


» mi «H MM * IS* MM 


.Sa 


1997 


1998 


2000 


• 4-way CPU nodes 


□ I/O bandwidth (GB/s) 



• Jhcaaaskig sysbsn loads 

• Increasing system 
bandwidth uegiiinan ents 


JBTALKS/MPF 1097 /MPF 1097 .PPT 


Microprocessor Forum 


Page 8 


October 14-15,1997 


Architecture at HP: Two Decades of Innovation 


New Cost-Value Measures 



Web Predictability 


With HPLabs technology 
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■ Normal •••• With HPLabs technology 



m Normal »»»• SSL encrypted 


• Need median Jansfbr 
p iedJcta hiHty , ee ex i rity , 
xetiabitity 
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New Control Points 


Processor 

HH8H Memory 
BHB components 

I/O 

devices 





Memory 



Processing 
System Control 


Communication 


I/O 


Processor-centric Bil ly Semi-autonomous 

Subsystems 
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New Control Points 

... in an open industry standard framework 
that permits system value adds 




Processing 


System Control 


Memory 



I/O 


Communication 


Intelligent I/O (120) 

e.g., IP routing, encryption, firewall, 
filtering, compression, high-speed back-up. 


J8TALKS/MPF1 097/M PF 1 097 . PPT 


The New Challenges 


Computing Components 



- Semi-autonomous subsystems 

- Integrated communication, 
memory, and computation 


Computing Systems 



- Distributed, heterogeneous 
systems of systems 


Computing Services 


(Information and Computation "Utilities") 

J8TALKS/MPF1097/MPF1 097.PPT 
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Architecture at HP: Two Decades of Innovation 


The New Challenges 


Computing Components 


- Custom (embedded) 
processors 

- New control points 



Computing Systems 


- Information appliances 

- Utility servers 

Computing Services 
(Information and Computation "Utilities") 

JBT AIKS/MPF 1 097/M PF 1 097 . PPT 



The New Challenges 


Computing Components 


- Custom (embedded) 
processors 

- New control points 




Semi-autonomous 

subsystems 

Integrated communication, 
memory, and computation 


Computing Systems 



- Distributed, heterogeneous 
systems of systems 


Computing Services 

(Information and Computation “Utilities”) 

HP Labs is focused on the new challenges of tomorrow 

JBTALKS/MPF1097/MPF1 097.PPT 



- Information appliances 

- Utility servers 
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The Evolving RISC Landscape 


The Evolving RISC 
Landscape 

Linley Gwennap 
Senior Analyst 
MDR 


Changes in the Past Year 

♦ HP, Digital contend for integer lead 

• Despite lagging 1C technology 

♦ First 0.25-micron chips appeared 

• IBM, Motorola, Tl lead the way 

• Need new core for best performance 

♦ Intel gained momentum, design 
wins at key RISC strongholds 

• Apple, Tandem, Silicon Graphics 
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What DIDN’T Happen 

♦ No new high-end cores shipped 

• PowerPC 750 shipped, but not high end 

• PA-8200 offers only minor changes 

♦ 21264 didn’t meet schedule 

• Tape-out delay puts shipments to 2Q98 

♦ Exponential didn’t survive 

• Bipolar hopes extinguished 


RISC System Consolodation 

♦ RISC market continues to grow 

• $51.7 billion in system sales in 1996* 

• 24% growth over previous year* 

♦ Few vendors use RISC other than 
processor developers 

• No NT on MIPS, PowerPC 

• Apple closes Mac clone market 

• Compaq moves Tandem from MIPS 

‘Source: Andrew Allison 
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RISC Drives FP Performance 


0) 

C/3 

03 

.Q 

LO 

03 

O 

LU 

CL 

c n 



HP MIPS Alpha IBM Sun Intel 

PA8200 R10000 21164 P2SC Ultra-2 Pent II 

236 250* 600 135 300 300 


□ FP 
■ Int 


(Source: SPEC except ‘vendor estimates) 


RISC Drives CPU Design 

♦ Dual-issue floating-point 

• Two independent 64-bit FP ops/cycle 

• Sometimes dual FP MAC (4 FP ops/cyc) 

♦ More on-chip cache 

♦ More memory bandwidth 

• GBytes/s versus Intel’s 528 Mbytes/s 

♦ Better MP scalability 
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What to Watch For in 1 998 

♦ Leading RISC processors deliver 
30+ SPECint95, 50+ SPECfp95 (base) 
• Twice Intel’s best performance in 1 998 

♦ HP continues to vie with Digital for 
integer performance lead 

♦ Others will lag in performance 

♦ RISC processor designs focus on 
transaction and scientific servers 
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UltraSPARC - III 

A Scalable High Clock Rate SPARC Processor 


Gary Lauterbach 

Sun Microsystems, Inc. 

ml eraqrstwm 


US-Ill Goals and Motivations 


Motivations: 

■ Memory sub-system is key to scalable performance 

■ Wire delays dominate cycle time 


Goals: 

■ Industry-leading memory and system bandwidths 

■ Performance scaling on existing binaries 


■ Industry-leading system scalability: 
1000+ processor systems 

■ Rapid performance scaling with future 
process technology 

■ Full Sparc V9 and Solaris compliant 
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Overview 


Physical Characteristics Pipeline Characteristics 


Process 

.25ji CMOS 

6 layer metal 

Clock 

600+ Mhz 

Die Size 

330 mm 2 

Power 

70w@1.8volts 

Transistor Count 

RAM 12 million 
Logic 4 million 

Package 

1200 pin LG A 


Issue 

4 Integer 

2 Floating Point 

2 Graphics 

LI Caches 

64KB - 4 way Data 

32KB - 4 way Instruction 
2KB - 4 way Prefetch 

2KB - 4 way Write 

L2 Cache 

1 , 4, 8MB - 1 way 

On-chip Tags 

Off-chip Data 


%>3im 

CRV.tJirn.iK. 



US-Ill Floorplan 
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Pipeline Features 


■ Est. 35+ SPECint95, 60+ SPECfp95 
@ 600MHz (Base) 

■ 14 stage non-stalling pipeline 

■ Low latency, large, multi-set LI data cache 

■ Extensive hardware/software prefetch support 

■ Fully integrated system and memory interfaces 

■ Code schedule compatible with 
UltraSPARC-1, II 

■ System software compatible with 
UltraSPARC-1, II 

^Bm 

i7llC/V£7««f7C 


Pipeline Diagram 


Front 
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32KB 4-way associative instruction cache 

— Fetch any 4 instructions in a 32 byte line 

— Virtual address micro-tags 

— 32 byte line size, 2 ns access time 

16K entry history-based branch predictor 

— modified gshare with 12 global history bits and 
14 branch PC bits, 2 bits per entry 

Branch mispredict cost 

— 7 cycles max 

— 3 cycles average mispredict taken 

8 entry jump target/return address stack 
20 entry instruction queue 
4 entry miss queue 

— Saves fall-through path from mispredicted taken branches 




mteftcjmatAt 



Instruction Fetch 




Return I ^ 


Address Stack 1 


^ 



U&TOKjWttrilf 


32KB 4-way Associative Instruction Cache 

16K Entry History-based Branch Predictor 

Branch mispredict cost 

8 Entry Jump 
Target/Retum 
Address Stack 


20 Entry 

Instruction Queue 

4 Entry 
Miss Queue 



— Inst. TLB 


Instruction 

Queue 


32 KB 
Inst Cache 


Branch 

Predictor 


h 

h 
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Integer Execute 


■ 4 integer issue from: 

— 2 Arithmetic/Logical/Shift 

— 2 Loads 

— 1 Store 

— 1 Branch 

■ Arithmetic/Special Unit (ASU) 

— 6-9 cycle integer multiply 

■ Pipelined predicated execution 

— 1 conditional move/cycle 

■ 7R3W port Register File 

— 1024 bit window change/restore 
— WRF - single window Working 

Register File 
— ARF - 8 window 

Architecural Register File 



mlztMji-Urrc 


ct 


Floating Point/Graphics Execute 


■ 2 instruction issue from: 

— FP/graphics add 
— FP/graphics mul/div/sqrt 

■ Fully pipelined 
FP/graphics add/mul 

■ Concurrent FP div/sqrt unit 

■ 5 Read, 4 Write port 

register file 


Instruction Latency 


Instruction 

Latency 

Graphics Add 

4 cycles 

Graphic Multiply 

4 cycles 


Single 

Double 

FP Add 

4 cycles 

4 cycles 

FP Multiply 

4 cycle 

4 cycle 

FP Divide 

17 cycle 

20 cycle 

FP Square Root 

24 cycle 

24 cycle 
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Memory 
Subsystem 

■ 64KB 4- way associative LI data cache 

— 32 byte line size 

— Virtual indexed, physically tagged 
— Write-through 

■ 2KB 4- way associative LI 
prefetch cache 

■ 2KB 4-way associative LI write cache 


■ 1-8 MB direct-mapped L2 cache 
— ECC protected 

— 256 bit datapath, 12 cycle latency 

— 3 or 4 cycle pipelined access 

— Level 1 caches are non-inclusive 

— 90KB on-chip L2 Cache tags 




■ 8 entry RAW forwarding 
store queue 

■ On-chip DRAM controller 


Cache 

Latency 

Bandwidth 

LI Data 

2 cycles 

9.6 GB/s 

LI Prefetch 

3 cycles 

18.4 GB/s 

LI Write 

1 cycles 

13.6 GB/s 

L2 External 

12 cycles 

6.4 GB/s 




Data Cache 


■ Sum Addressed Memory(SAM) 

— Eliminates address adder 
from load path 

— 2 gate delays 

in address decode 
— No carry propagation 

— 1 load bubble 

■ Micro-Tags 

— 4-way associativity 
without access time penalty 

— 8-bit virtual address tag 
used to select way 

— Full physical address used . 
to detect cache miss 


Sub-banks 





Way 

Select 
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Prefetch Cache 

■ Concurrent fill and dual reads 

■ 8 outstanding software prefetches 

■ Autonomous hardware stride prefetch 

■ Fully coherent cache 





m\Us*y,vtmz. 


Multi-processor Scalability 

■ DRAM controller per processor 

— Fully programmable timing, interleave, etc. 

— 4 banks of DRAM, 4 GB max, 170 ns cycle miss latency 

— System memory bandwidth scales with 
number of processors 

■ Low memory latency 

— Minimum latency up to 4 processors 

— Small latency increase to 32 processors 

— Cache to cache transfers lower latency than memory 

■ Large Multi-processor scaling to 
1000+ processors 

— Minimum overhead, several hundred gates 
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UltraSPARC - III A Scalable High Clock Rate SPARC Processor 


External Interface 

■ Coherency BW scales with processor technology 
— On-chip snoop tags (50%BW of L2 on-chip tags) 

■ 9.6 GB/s coherency bandwidth 

— High frequency, single cycle request 
— Distributed request arbitration 

■ 2.4 GB/s off-chip bandwidth 

■ Up to 15 outstanding transactions 

— Tagged transactions, out-of-order completion 
— Full V9 total-store-order compliance 

■ Independent boot/diagnostic bus 


n 
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Application Enhancements 


■ Media enhancements: 


— VIS extension - byte mask/shuffle, 
pipelined alignment 

■ Scientific enhancements: 

— Rounding mode support for interval arithmetic 

■ JAVA acceleration: 


— Jump target preparation 
— Coherent instruction cache 

■ Networking acceleration: 

— “No snoop” page attribute 

— Multiple outstanding Block stores - 
2.4GB/s block copy bandwidth 


a 


^uun 
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UltraSPARC-m 


Derivative Products 

— Frequency & 
performance increase 

— Cost engineered 
Architecture Extensions 

— Java 

— Media 


i-Series 


H 

Time 




a 

\/ 


y. 

miaMjntta 


Summary 

■ Spectacular performance with full compatibility 

— Full SPARC V9, Solaris and code schedule 
compatibility with US-I, II 

■ Scalable performance on existing binaries 

■ Industry leading scalability: 1000+ processors 

■ Industry leading system and 
memory bandwidths 

■ Performance scaling with future 
process technology 

■ Numerous application-specific 
performance enhancements 

UllUSQTStWA; 
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POWER3 



Next Generation 64-bit 
PowerPC Processor Design 


Mark Papermaster 

High-performance Processor Design 
IBM Microelectronics 
Austin, Texas 



Microprocessor Forum 
October 14, 1997 


Agenda 




□ POWER3 Project Goals 

□ Microprocessor Roadmap 

□ Key Product Specs 

□ Processor Overview 

□ System Implementation 

□ Verification 

□ Performance 

□ Status & Future Directions 
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Goals 


pt 


m m 

% Wi 


□ Product 

—Outstanding system-level performance for workstations, 
servers, super computers 

-Mission critical applications, i.e. mechanical CAD, data 
mining, seismic analysis, OLTP 

—Build on POWER2 strengths with PowerPC architecture, 
SMP scalability, 64-bit 


IS 


!□ Processor 

- Bandwidth , memory system and dispatch capability to feed 
a highly superscalar execution core 

—Four floating-point operations per cycle 
[□ Technology 

-Introduce in established technology; rapidly advance to 
CMOS 7S technology 


m WM 


'$&■ 


r. v • • 

'MV* 



P2SC+ 

> 160MHz 

> 256 mm 2 


P2SC 

> 135 MHz 
>355 mm 2 




h-end Processor Roadmap 


> Floating 
point 
leadership 
design 


singledie 
Deep Blue 
processor 
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Key Product Specifications 



□ Size & power 

—15 million transistors 
— Die: 270 mm 2 
— 46W 

Technology 

-CMOS 6S2 

—0.25 pm hybrid 
lithography 

— Five levels of metal 

-2.5 volts 

Package 

—1088 pin ceramic 
package 

—748 signal I/O 

flp Clock distribution 

— <100psec skew at 
latches 




Superscalar Processor 



n 


Floating 

Point 

Unit 


Floating 

Point 

Unit 


5~~T 


Fixed 

Point 

Unit 




Fixed 

Point 

Unit 


Fixed 

Point 

Unit 


LD/ST 

Unit 


Branch/Dispatch 


I 


I 


LD/ST 

Unit 


Memory Mgmt Unit 
Instruction Cache 


l 


rr 


Memory Mgmt Unit 
Data Cache 


32 Bytes 


t 


32 Bytes 


Bus Interface Unit: L2 Control, Clock 


t 


f 


32 Bytes 


1 6 Bytes 


L2 Cache 
1-16 MB 


6XX Bus 
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Execution Core 



□ Three fixed point units 

—Two units implement single cycle operations 
-One unit for complex, multi-cycle instructions 




□ Two floating point units 
-Double precision data path 
-Three cycle latency, one cycle throughput 
— Each unit contains divide and square root sub-units 
—24 real and 32 virtual rename buffers 





□ Two load / store units 

— Each unit calculates one load or store / cycle 
-Loads processed speculatively 
-16 entry store queue 

□ Branch unit 

-2048 entry branch history table 
-128 x 2 entry branch target cache 
— Four pending predicted branches 




System Level Bandwidth 



16 Bytes 
1.6 GB/sec 
@ 100MHz 


32 Bytes 
6.4 GB/sec 
@ 200MHz 


1MB-16MB 



DEC 

4000/4100 

5/400 


SGI 

Origin 200 
195MHz 


POWER3 

w/ASCI 

Node 

(est.) 


Performance data as of 9/97, as reported in corporate websites and other public sources; except IBM data, which is estimated. 
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High Bandwidth: Instruction Cache 



32 Bytes - 8 Sequential Instructions 

I Cache ^ : > ^ 


iMW/JX'AvsvXXv; 


128 


128 

Lines 


Lines 


128 Bytes 


WA'Iv'vIvX'A'.'XvA'.VAW.V^V.' 


Cache Reload Buffer 128 Bytes 


32 

Bytes 


Bus Interface Unit 


• 32KB instruction cache 
128-way associativity 
128-byte line size 
2-way interleaved 


16 Bytes 


32 Bytes 




High Bandwidth: Data Cache 



8 Byte 8 Byte 8 Byte 

Load Unit Data Store Data Load Unit Data 

••x* -rv'^vWx'S ,<.v 



6XX Bus Private 
L2 Bus 


64KB data cache 

1 28-way set associative 
8-way interleaved 
4-way by line 
2-way by double word 
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Decode-to-Completion Bandwidth IMm 

4444 


Four instructions per cycle completed 


W: 

Hi m m 

1 1 m m 

m 

0$ 

Mi: w* 

■ - Sit's.-: 


Eight Instruction execution 



Four instruction per cycle dispatch 


IDispatch Bufferj 4 instructions 


j I Buffer | 12 instructions 




From Instruction Cache 




Reduced Latency Memory Subsystem IMm. 



i 


m 


LI Instr. 
Cache 


A 


32 Bytes 


m... 

'S\ 


LI Data 
Cache 






32 Bytes 




6* 


MUX 


MUX 




T 


(Instruction 
Prefetch 
Controls 
from 
L2 only 



m f| p 






^ 

Data 


i|p §£* f|i 

Prefetch 

H-i? §&& 

Controls 

#; m |f& 

|i< Hi 

from L2 

II |i p 

or 6XX Bus 

ii ii is 




Instruction 
Load Queue 
128 Byte 




m 


Data 
Load Queue 
128 Byte 



. 

Sfl 


16 Bytes 

6 XX Bus 


4^ 

32 Bytes 

Private 
L2 Bus 


Bus Interface Unit 


Non-blocking caches 

Four outstanding data LI demand 
requests 

Two outstanding instruction LI demand 
requests 

Dedicated hardware prefetch to LI , 
L2 and main memory 

l-side prefetch to predicted path and 
next line 

D-side prefetch to up to four streams, 
each stream prefetches next two lines 
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Verification 







□ Challenge 

-Verify a highly superscalar design in multiple system 
configurations 

p Methodology 

—Simulation used over 400 POWER2-class RS/6000 systems 
-90 billion cycles targeted to cover 
Over 100,000 defined logic events 
Over 10,000 memory hierarchy bus scenarios 
-Thousands of checks executed on every cycle of testing 
—Majority of testing targeted to MP 
— Extensive array switch level simulation 

□ Results 

— Booted OS environment on first pass silicon 
-Achieved MP functionality on first pass silicon 


System Implementation 



m 




*>•• >■>. v&bk w&f 




Q Supports bus- and switch-based MP bus memory 
configurations 
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Performance 





DEC HP SGI POWER3 

4000/4100 Exemplar Origin 200 w/ASCI 
5/400 V-Class 195MHz Node 
1 CPU (est.) 


0.15 


DEC Sun HP POWER3 

21164 UltraSPARC PA8200 (est) 







Performance data as of 9/97, as reported in corporate websites and other public sources; except IBM data, which is estimated. 



Rev the Engine . . . ImM 

□ POWER3 processor in system bring-up phase 

— First pass silicon 1 Q97 

I — Initial design center 200+ MHz achieved 

[□ Design focus to bring POWER3 to leading-edge technology 
and tune architecture 

—Implementation in CMOS 7S 
I —0.20 micron technology 
I —Copper metallurgy 

[□ Design honed for higher clock speed and commercial 
application performance 

—Set associative L2 support 
— Fractional bus modes 

□ Next POWER3 processors in design phase 

—300+ MHz and 500+ MHz design targets 
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Summary 



m 



□ Very robust design point... 

High bandwidth 
Highly superscalar 

...to match customer’s requirement for high 
computational capability 

□ Functionality on first pass silicon 

□ Two follow-on POWER3 design efforts underway 
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The HP PA-8500 RISC CPU 


High Speed SRAM 


with an 


Integrated CPU 

ffipi HEWLETT 

miXM PACKARD 

Bill Queen 

Systems Technology Division 
Hewlett-Packard Company 


Presentation Overview 

^3 

PA"5i]as 

powered 

• Cache SRAM Trends 


• Goals of the PA-8500 


• PA-8500 Processor Core 


• Improvements 


• Caches 


• System Bus 


TJffi HEWLETT 
mllUm PACKARD 
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Cache Trends 

Cache Size (log KB) 


^3 




HEWLETT 

PACKARD 


Year 


Goals for the PA-8500 ^ 

£ 

raw** 

• Leadership application performance 

• Reduced system cost 

• Full performance without recompilation 
on PA- 8x00 binaries 


RBI HEWLETT 
mL/iM PACKARD 
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PA-8500 Strategy ^ 

p a w e m c a 

• Boost frequency by porting to 0.25 micron 

• Design aggressive fast and large primary cache 

• Combine cache and CPU on one die 

• Increase bandwidth to main memory 

YZ21 HEWLETT 
1 "HA PACKARD 
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PA-8500 Processor Core 


PA-ia3BS 

W O W c W K D 


m 
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§ 
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c 

(A 


HEWLETT 

PACKARD 


Key Improvements to 


Processor Core 

r a w c n c o 

• Larger 1 60-entry TLB 


• Larger 2048-entry BHT 


• Improved branch prediction 


(“agrees” mode) 


Thpl HEWLETT 
mifLM PACKARD 
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PA-8500 Branch Prediction ^ 

PA-73.f:S£? 

_ P O W C R E Q 

Modes 

• Static 

Compiler-directed 

(Heuristics/PBO) 

• Dynamic 

Hardware-directed 
(“T aken/N ot taken’ ’) 

• “Agrees” 

Dynamic prediction 
using static hints 

W’LTM HEWLETT 
mUHM PACKARD 



Dynamic Prediction: Problem ^ 


r □ W C R C D 


Code 
Stream 
Br 1 


Br 2 


Static 

Hint 

Taken 


Not Taken 


Branch 

Outcome 





CONFLICT 

(Both branches 
map to the 
same location) 


Conflict in the BHT causes poor branch prediction 

and degrades performance 
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mHEM PACKARD 
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Dynamic Prediction: 


Solution ^ 

PA-iSOST 


: s«s 

o 


Code Static Branch “Agrees” 

Stream Hint Outcome Outcome 

Br 1 Taken T 64 Agrees” 

• \ 

• ^ 

• 

Br 2 Not Taken NT “Agrees” 

“Agrees ” Mode 
Improves performance by increasing branch prediction accuracy 
Effectively results in a larger number of BHT entries 

WK3% HEWLETT 
mUKA PACKARD 



Accurate 

Prediction 

(Effective 

conflict 

elimination) 


PA-8500 Caches ^ 

PA-35!! SB ^2 

p a W C ft c D 

Data Cache 

1.0 MB 4-way set associative 

Instruction Cache 

0.5 MB 4-way set associative 

Frequency 

360+ MHz 

Latency 

2-cycle pipelined access 

Bandwidth 

2 accesses/cycle 

Cache-Line Support 

32-byte 


64-byte 

Error Protection 

Single-bit-correct ECC 

That HEWLETT 
mllrA PACKARD 
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PA-8500 



QA-75.1C 



HEWLETT 

PACKAPD 


PA-850 


PA- 3* OB 
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The PA-8500 RISC CPU: High Speed SRAM with an Integrated CPU 


PA-8500 




pa-robs 

r a w c m k o 


Large Fast Caches ^ 

PA-3$0 £ 

PQ WCM 

Design Solutions 

• Careful composition (e.g. associativity) 

• Effective use of silicon resources 

• Cancel effects of clock skew 

• Special cache clocks 

• Special attention to the functional definition 

HEWLETT 
KKJ PACKARD 
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The PA-8500 RISC CPU: High Speed SRAM with an Integrated CPU 


PA-8500 Caches ^ 

PA-SSOBC 

f» a w c n c d 

• Unprecedented design: 

High bandwidth, low latency, and large size ! 

• Balanced CPU execution bandwidth 

• Simplicity of solution 



HEWLETT 

PACKARD 


PA-8500 System Bus ^ 

PA-350 £ 

P* Q W K ft 

• Higher bandwidth mode 

• 2X increase in bandwidth 

• 1 state for address + 2 data transfers per state 

• 64B line-size 

• Designed to enable higher bus frequencies 

r®i hewlett 

mi/iM PACKARD 
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PA-8500: Meeting the Needs ^ 

PA-FiOS 

of a Diverse Product Line 

• High-performance processor core 

• Integrated SRAM and CPU 

• Supports a variety of system configurations 

I3S1 HEWLETT 

WWM PACKARD 
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A SPARC Microprocessor for High-End Servers 



A SPARC Microprocessor for 
High-End Servers 
- SPARC64™-III - 

/ 

Hisashige Ando 

HAL Computer Systems Inc. 

Microprocessor Forum 
October 14, 1997 
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HAL SPARC 64™ Roadmap 
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A SPARC Microprocessor for High-End Servers 


HALstation™ 385 


• Powered by SPARC64-II CPU 

• 161MHzClock 

• 9.0 SPECint95 (peak) 

• 16.0 SPECfp95 (peak) 

• Mainly used for Compute Server 

— Scientific and R&D 
-EDA 



HALstation™ 385 


NASPAR2.2 Class A Benchmark Results 



HALstation 385 (161 MHz) 
IBM P2SC (120MHz) 

SGI Origin 2000 
HP SPP2000 
CRAY T3e-900 (450MHz) 
Sun Ultra 4000 
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A SPARC Microprocessor for High-End Servers 



HALstation™ 385 


Calibre LVS Run time for SPARC64-III chip 



25 

£2 20 
I 15 
10 
5 
0 

HALstation 


K460 


385 


11 Compare 
□ Extract 
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SPARC64™-III Key features 

• SPARC V9, 64bit, 4 issue superscalar, out-of- 
order execution processor 

• Full 64-bit virtual address support and large TLB 
for large memory applications 

• On-chip L2 cache controller supports 1M-16MB 
external L2 cache 

• UPA-compatible system bus interface 

• ECC & Parity protection for high reliability 
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SPARC64™-III Physical features 


• 17.6M transistors 

• 15.98 x 14.85 mm chip 

• 0.25um CMOS technology 


- Transistor: L e ff= 0.18jim , t ox =5.5nm 

- 5 layer Metal + LI 

- 0.9um/0.9um/0.9um/1.8um/2.7um pitch 
• Flip Chip 


• 957 pin Land Grid Array package 



AhjilHCoM^aay 



SPARC64™-III Die photo 



mm* 










> wfcu) 















A T yh i C ef i» y 



Microprocessor Forum 


page 4 


October 14-15, 1997 



A SPARC Microprocessor for High-End Servers 


SPARC64™-III Micro architecture 

• 4 instruction issue per cycle 

• Reservation Stations: 8 entries for integer, agen, 
and floating point execution units, 12 entries for 
ld/st address queue 

• 6 operation dispatch per cycle: 2 integer, 2 agen, 2 
floating point 

• Renaming Register file 

• Separate pTLB and large (unified) TLB 

• 2 bank, 2 access/cycle D-Cache 



A r »fH» Co 1^9 M y 


SPARC64 T "-m Block Diagram 


BHT 


BRU 


Issue 

Unit 



RPS 



► 



y 

i 

r 



* 

Prefetch 






Precise 

buffer 


Fetch 



State 

io4 


Unit 



Unit 

16KB 










External 
U2-$ RAM 


UPA Bus*^- 





> 

r * 




jilTLB 

32entries 



• f 

1 

11-$ : 
64KB 

11-$ 

Control 

* . 

r 


U2-Cache & 
UPA Interface 

i 


1 


ma 

-m 

1 


••• 


Rename | Floating point RF 


Rename I Integer RF | t 


■*1 Rename | Condition Code 

S5SS55 



Branch miss-precfiction Handler 


£ Fixed Point 




EZ 


* ■ ■ ■ 


Integer ALU R 


£ AGEN/Fixed Pj-^^l Address Adder 


la Floating Point 




m. Load / Store 




FA 


FMA 


FDIV / SORT 


* 


ti >«">* ii 

Load/Store UnitE *= — f 


Translation 
Unit 


3 £ 


► jiDTLB 
- 32entries 



♦r- '' 

Aj 

3 


01 -$ 
Control 


D1-$ 

64KB 
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SPARC64 ™-III Pipeline 



SPARC64™-III Fetch and Issue unit 



ILT (instruction link table): 
Predict next fetch PC 

BHT (branch history table): 
Supports both One and Two 
level adaptive / 2-bit counter 

RPS (return prediction 
stack) 

10-Cache: level 0 recoded 
instruction cache 
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SPARC64™-III Renaming Reg file 


reg# 


SN 



integer: 

- 5 windows 

- 128 words 

- 1 0 read / 4 write 

- 16 check points 

floating point: 

- 64 double precision or 128 
single precision entries 

- 6 read / 4 write 

- 1 6 check points 





SPARC64“-III 

Floating Point execution unit 


load data 



• multiply and add 

- 4 cycle latency 

- 1 cycle pitch pipelined 

• add 

- 3 cycle latency 

- 1 cycle pitch pipelined 

• divide / square root 

- 12 cycle (single) 
21-22 cycle (double) 
latency 

A CoMpcny 
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A SPARC Microprocessor for High-End Servers 


SPARC64™-III Load/Store Unit 


3d/st address: 
E queue 
12 entries : 


£ 



find oldest 
& 2nd oldest 


out-of-order 

candidates 


select depend on memory model 



2 load/store operations per 
cycle 

support 3 memory models 

- LSO: load store ordering 

- TSO: total store ordering 

- STO: store ordering 






SPARC64"-III Cache hierarchy 


Instruction buffer 

1 1 6byte/cycle 


10-$ 16KB 
direct map 
VIVT 


Load/store unit 


I0-$: recoded 
instruction cache 


11-$ 64 KB 
4way 
VIPT 


1 6byte/cycle | |8byte/cyclex2 • On chip 64KB + 64KB 


t 


D1-S64KB 
4way, 2bank 
VIPT, writeback 

j 


1 6byte/cycle 


separate cache 

Off chip large unified 
cache support 


U2-$ 1~16MB(8 SRAM chips) 
direct map, PIPT, write back 


T 


1 6byte/UPA-cycle 
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SPARC64™-III 

Address Translation Hierarchy 




h$ 

VA-** 

zu 

- 'i 

i-pTLB 

32 entries 
fully associative 1 

v 






( 

A 

rkkk 

ilii 

V 

M TV 

1 r Tl D 


D-pTLB 

32 entries 
fully associative 

e 



...J 


> 


VA on miss 


PTE 


Main TLB 
256 entries 
fully associative 


< — ► Software 


variable page sizes (16): 
4KB to 4GB 

OS and DB SGA can be 
mapped with few giant 
pages 

2 level TLB: fast access 
and large capacity 



A hjrfu C o i yii y 


SPARC64™-III Summary 

• Carries HAL SPARC64™ architecture into future 

- Full 64bit architecture and implementation 

- >50% faster than current SPARC64™-II 

• For computational server and database server 

- Large Memory Support 

- High reliability with ECC & parity 

• Single Chip CPU 

- 1—1 6MB off chip L2 cache support 

- integrated UPA interface 

AhjAMCon^Hy 
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DVx MPEG-2 Video Codec 


DVx MPEG-2 Video Codec 



Microsystems 


Fulfilling the Promise of Digital Video 



Les Kohn 

C-Cube Microsystems 


1 




2000 


MPEG Encode Application Roadmap 


C-Cube 

Microsystems 
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DVx MPEG-2 Video Codec 


DVx Objectives 



▼ Enable MPEG-2 content creation on PC 

▼ Low delay simultaneous encode and decode for video 
communication 

▼ Professional profile for studio applications 

▼ Scaleable to HDTV 

▼ State of art quality 


3 


DVx: The Path to Personal Publishing 



Microsystems 



CD/R or HD vHS recordable DVD + DVx 


MPEG 



DVD Player 


Incompatible recording 
& playback formats 


Compatible disc-based 
PC recording and DVD Playback 
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DVx MPEG-2 Video Codec 


JPEG 

Editing Suite 


DVx Enables MPEG-2 Video 

Editing 

▼ Dual Stream Decode with Effects 

✓ Frame Accurate Non Linear Editing 


▼ Advantages of MPEG-2 


C-Cube 
Microsystems 



Images 



Editing 
and 
Jfi SFX 
^ Control 



Edited Image 


✓ Compatibility with DVD & DVB 

✓ Storage and Bandwidth Savings 

♦ JPEG = 30-50 mbps 

♦ MPEG-2 = 4-18 mbps 



5 


Dual Stream Editing Example 


C-Cube 
Microsystems 



Decode stream A 
(decode order) 

Decode stream B 
(deode order) 

Output Stream 
(display order) 



Transition 
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DVx MPEG-2 Video Codec 


Design Concepts 



C-Cube 
Microsystems 


▼ Standard RISC engine for High Level Operations 

✓ Flexibility to support different applications and add features 

✓ Ease of programming and good tools 

✓ Quick bug fixing 

▼ CISC Coprocessors for Pixel Operations 

✓ Optimized hardware for performance critical functions 

✓ Complex instructions minimize rise core demands 

▼ Design for Worst Case, not Average Performance 

✓ Avoid cache, main memory and vie bottlenecks 


7 


RISC Engine 


C-Cube 
Microsystems 



MicroSPARC 32-bit RISC 

✓ Single Scalar 

✓ Fast simulation w/Sparc WS 

16K byte Instruction Cache 

✓ No misses inside of loops 

8K byte Data Memory 

✓ Software Managed 

✓ Overlapped DMA transfers 

✓ Predictable performance, unlike Cache 

♦ Cache misses would be unavoidable inside loops 


Audio/ 

Video 

ME 

DSP 

mmmmm 
\m wM \ 

y . ..... .. 

RISC 

• 

PCI 

1C 

TMEM 

DMEM 

i Cache 

SDRAM 

WMEM 

niviciv! 
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DSP Coprocessor 



C-Cube 
Microsystems 


t Vector M->M instructions 

✓ Sparc Coprocessor Instructions 

✓ Code density and low issue rate 

t 8Kx8 Data Memory 

✓ Double buffered to allow 
concurrent DMA and DSP 

▼ DSP Functions 

✓ DeTelecine 

✓ Activity Measures 

✓ Motion Compensation 

✓ Adaptive Temporal Filter 

✓ Linear Filter / Decimation 

✓ DCT/IDCT 

✓ Quantization / Dequantization 

✓ Variable Length Coding / Decoding 


Audio/ 

Video 

ME 

DSP 

' 

VvV‘vX;"v’vXvX;';.; 

RISC 

PCI 

1C 

TMEM 


1 Cache 

SDRAM 

WMEM 

gi|P®g||| 

RMEM 
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DSP Instruction Execution 


C-Cube 
Microsystems 




External Memory 
Interface 

Instruction Cache 








DMA Instruction 

* 

DSP DMA 


m 

Rise 


Queue 





Core 






1 





VLCode & Decode 


DMEM 



DSP Instruction 







Queue 


Filter & DCT 




Return Value Queue 
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DSP Synchronization 



Microsystems 


Instructions: 

✓ Swapdsp 

✓ Syncdsp 

✓ Syncrmem 

To Process N blocks of data: 

✓ Do for n + 2 iterations: 

♦ If not last two iterations perform DMA DSP loads 

♦ If not first or last iteration perform DSP operations 

♦ If not first two iterations perform DSP stores 

♦ swapdsp 


11 


ME Coprocessor 



Microsystems 


t Architecture 

✓ Doubled buffered Target and 
Reference Memories 

✓ Reuse data between targets 

✓ 64 abs diff/cycle throughput 


Audio/ 

Video 

■-i;. . • 



PCI 

ME 

DSP 

RISC 

1C 

^x^^xjx^x^xox^x^x 

XvX;X£X\vXv;;;:X.vX;> 



TMEM 


1 Cache 


DMEM 

SDRAM 

WMEM 

RMEM 


▼ Programmable ME Algorithm with Minimal Rise Core 
Overhead 

✓ RISC CPU writes ME search command for each target in sdram 

✓ ME writes search result list in sdram 

✓ Single interrupt generated at end of each ME stage 
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ME Command Processing 

▼ ME Pipeline 



C-Cube 
Microsystems 


Command 

SDRAM 

Fetch 


Target and 
Reference 
SDRAM 


Target 

'W 

Store 



Search 


Result 








Fetch 




in ouhAivi 


▼ ME Command and Result Format 


Vx 

Vy 

Score (result) 

F 

L 

D 

A 

D 

V 


Parameters (command) 
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Inter-chip Communication 


C-Cube 
Microsystems 



▼ Scaleable to HDTV 

✓ Image reference data transfers 

✓ Master/slave control 



▼ Independent Receive /Transmit Channels (2 each) 

▼ Simple Protocol 

✓ Packet based 

✓ Routing to appropriate chip based on address field in header 
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DVx System Configuration 


C-Cube 
Microsystems 



Real-Time AN Input/Output 



•Glueless Interfaces 

bridge or external FIFO 




Audio 

Audio 1/0 




uooec jf 



NTSC/PALI 
Encoder 



NTSC/P Al 
Decoder 



64-bit 


PCI BUS 


32-bit 

-h- 


Audio / 
Video 


SDRAM 


PCI 


DVx 
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DVx Statistics 


▼ 5.4 Million Transistors 

▼ 100-MHz 

▼ 162 sq mm 

▼ .35m 4LM CMOS 

▼ 4.7W @ 3.3V (encode) 

▼ 352 pin BGA 

▼ Sampling now, production 12/97 



Microsystems 
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DVx MPEG-2 Video Codec 


DVx Die Photo 



C-Cube 
Microsystems 


SDRAM 

Audio 


l/F 

l/F 



PCI 

Video l/F 


Sparc ME 
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Summary 


▼ Cost Effective, Multi-Application Codec 

▼ Flexible: Programmable and Scaleable 
t Professional 

- Enables migration to HDTV, ATV 

- Video Collaboration 

▼ Prosumer - DVx + Recordable DVD 

- Complete mpeg-2 content creation solution 

- Broadcast quality at entry level price 


DVx : Fulfilling the Digital Video Promiv 


e 



Microsystems 



HDTV, ATV 


VIDEO 

COLLABORATiON 


DVD 

PUBLISHING 
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Processing 
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ManArray Technology: The Scalable Future of Signal Processing 


Billions of Operations Per Second 







BOPS 


INCORPORATED 

...a chipless chip company. 


Presents for the first time 

ManArray™ Technology 

The Scalable Future of 
Signal Processing 

by Gerald G. Pechanek, CTO. 


Representing the 
ManArray Design Team 
Chapel Hill, North Carolina. 


BOPS 

INCORPORATED 


ManArray 


TM 


A scalable architecture for a family of high- 
performance single-chip parallel processors 


Based on a compute array of Processing 
Elements (PEs) and controller Sequence 
Processors (SPs) 

Billion-operations-per-second performance in 
3D graphics, video compression, audio, and 
other compute-intensive tasks 
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BOPS 




INCORPORATED 


The ManArray™ Technology Message 





ManArray™ Delivers Performance 

Optimized scalar, packed data, eVLIW, Vector, SIMD, & Multiple-SIMD 
Single cycle PE-to-PE communications through novel ManArray network 

ManArray™ Scales to Multiple Products 

Family of high performance cores: 1x1, 1x2, 2x2, 2x4, 4x4, 4x4x4 
Novel interconnection network between PEs and clusters of PEs 
Software investment scales across core family 






II ManArray™ Is Easy & Fun to Program 



Open instruction set architecture & tool set across BOPS products 
Selectable parallel programming models: DSP, eVLIW, vector, SIMD 


IIP ManArray™ Is Low-Cost & License-able 

Novel ManArray topology & eVLIW design reduces chip wiring 
Regular layout lowers cost and increases performance 
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INCORPORATED 



2x2 Cluster Building Block: 


Logic Core: 

Processor Elements (PEs) 
Cluster Switch (CS) 
Sequence Processor (SP) 

Application-sized 

Memory: 

PE Local Data Memory 
SP Data Memory 
SP Instruction Memory 
VLIW Memory (VIM) 

Buses & Control: 

bandwidth matched to topology 

Instruction Bus (I) 

Data Bus (D) 

DMA Bus (M) 

Control & I/O Bus 
DMA Controller 
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| INCORPORATED 

Amammmmmm 


ManArray 


TM 



low cost 


vX*x*xox-x-x-: 


2x2 Cluster Building Block: 


What are PEs? 
WhatisVIMand 
how does it work? 
What are SPs? ill 


Why a 2x2 Cluster? 
How can you scale to 
larger & smaller 
arrays? 



%%\>xxs%%%<%v.\%%*x^N*XsvyX;X%*x% , xvX%*x*x<*x*x*x*x*x*x%^x*x*«*x%*Xvx , X\vXX*x*xyr\^*XN*x*xv:*XvX*x*x*x^Nv^^«x*x*x , x*:\*x*x*x* 





Page 5 


WWW.BOPS.COM 


INCORPORATED 


ManArray 


Inside a PE Node 



32-bits 


32-bit 
Data buses 




VIM address 




Load 


I I y I 


llii 

JWIV 

lemo 

BH 

iflll 

liil 

mm 

!?>$p!sp 

tipi 

fcsai 

III 







Instruction bus 

eVLIW 
bypass for 
simplex SIMD 
operations. 



Partitioned 
Execution Units 
(packed data) 

To/from 
Send ^ cluster 
Receive Switch 
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BOPS 

INCORPORATED 


ManArray™ 



eVLIW Setup 



Delimiter Instruction (DLM VIM1 ,5) 





Page 7 


The Delimiter Instruction sets the VIM address pointer. 

An eVLIW is an encapsulated set of 32-bit instructions 
located at a specified VIM address. 

WWW.BOPS.COM 



ManArray 

INCORPORATED J 


eVLIW Flow 


Load Rt, PEMem 



DLM VIM1 , 5 
1 1 Load RtPEMem 

Mpy Rx,Rt,Rs 
Add Rz,Rx,Ry 
Com Rs,East,Rz 
Store PEMem, Rz 



VIM addressl 


jj Bj 

iWMemef 

v {vifcli 

Load 



: ■ : • I’':: !' : . lill l l!!: : ’ 1:1 

S:SW:WSW : : : ::: WSS&WJ&SSS 


eVLIW i 


eVLIW 1 


i 


Parallel Decode & Execute Control 
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eVLIW Flow 


Mpy Rx, Rt, Rs 


DLM VIM1 , 5 
Load Rt,PEMem 


RXvRtRs 


Add Rz,Rx,Ry 
Com Rs,East,Rz 
Store PEMem,Rz 



VIM addressl 


mil 

— 1 

ifttlttSl 

Load 

Mpy 

mmawimmimsmmmsmmL 

§1111 

wmmi 

I-:*!*:*:*:*:-:*;-:*:*:-:-:*:-!* !*x*&*i*'xx’x*x*xv •;"!"x"x-x*x*;*x*x"; 

.•.V.V.V.V.V.V.V.* .V.V.V.V.V.'.’.V.V V.V,V. , . , .V.V.V.V. •.V.ViW/.V.V.V.V 

gSSJSKSSB;:: 


i. 


I 


eVLIW I 


eVLIW 1 
eVLIW 0 


Parallel Decode & Execute Control 
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WmmlM 

WvXvaww.svWIvX • 






eVLIW Flow 


Add Rz, Rx, Ry 


DLM VIM1 , 5 
Load Rt,PEMem 
Mpy Rx,Rt,Rs 


[Add Rz,Rx,Ry 

Com Rs,East,Rz 
Store PEMem,Rz 
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VIM addressl 



:#Vl 

IBB 

[||i|| 

HHfi 

-► 

Load 

Mpy 

Add 

*.y»vXy,'.'.'>.'Xw.'X'XyX.yX\ 

V.*>X*X 







v 

zL 

v 
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eVLIW i 


eVLIW 1 
eVLIW 0 


Parallel Decode & Execute Control 
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eVLIW Flow 


Com Rs, East, Rz 


DLM VIM1 , 5 
Load Rt,PEMem 
Mpy Rx,Rt,Rs 
Add Rz,FRx,Ry 


Store PEMem,Rz 
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VIM addressl 


VL 

m 

emor 

i m 

Load 

Mpy 

Add 

Com 

>X;X\vX*X*X\*X; 

_ 


xvi-x-x-r-x-x-x*:- 



v 

>r 



eVLIW I 


eVLIW 1 
eVLIW 0 


Parallel Decode & Execute Control 
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eVLIW Flow 


Store PEMem, Rz 


DLM VIM1 , 5 
Load Rt.PEMem 
Mpy Rx,Rt,Rs 
Add Rz,Rx,Ry 
Com Rs,East,Rz 



VIM addressl 


Store PEMem.Rz 


ll ; % 

IWM 

emor 


■■1 

Load 

Mpy 

Add 

Com 

Store 






V 

131 


— Jl 

>r 


eVLIW i 


eVLIW 1 
eVLIW 0 


Parallel Decode & Execute Control 
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ManArray ™ IB jl 


INCORPORATED 





, low cost 


Execute eVLIW (XV) 


Local PE 
Memory 
>1 kbytes 

H3 


Memory 

Switch 


±. A 


Load 

& 

Store 


32x32 

Register 

File 



XV Instruction 


VL 

IWM 

emor 

y (viil 


Load 

Mpy 

Add 

Com 1 

Store 






1 

o 


Parallel Decode & Execute Control 


\MAU/ 


kLU/ Adsu, 


- i 

f ^ 

f — ^ 






Send 

Receive 


The XV instruction is received in the PEs. 


To/from 
^ Cluster 
__ Switch 
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Execute eVLIW (XV) 


Local PE 
Memory 
>1 kbytes 

H3 


Memory 

Switch 

Q 


Load 

& 

Store 


32x32 

Register 

File 



XV Instruction 


VL 
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■pv 

Load | Mpy 
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f > 
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Send 



Receive 


To/from 


The eVLIW-1 is read from VIM and decoded. 
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Execute eVLIW (XV) 



fcocalPE 





1MM» 
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Efeil: 


[ 


XV Instruction 


XV 


Execute eVLIW-1 



VIM addressl 


I 


VLH 
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y (vii 

A) 

Load T 

Vlpy 

Add 

Com 

Store 










> 

f 

V > 

f > 

f > 

f 


1 

0 


Parallel Decode & Execute Control 



To/from 


Send » Cluster 
Receive Switch 


The XV eVLIW-1 is executed. 



Page 15 


WWW.BOPS.COM 


SSSlS ManArray 

INCORPORATED J 




Sequence Processor (SP) 


Each SP contains... 

...the same functional units 
found in the PEs, 

...Instruction & Data 
Address Generation Units, 

...Instruction & Data 
Memories, 

...PE Array Interface 
Control & I/O Bus Interface. 
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2x2 Cluster Building Block 


What are PEs? 
What is VIM? And 
how does it work? 
What are SPs? 
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A 4x4 T orus 


Standard PE 


North 


West 



East 


South 
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Rotate columns 
1, 2, & 3 down 
one position. 





...transformed... 


Torus after the 
columns 1, 2, & 3 
have been rotated 
down one position. 
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scalable ... 


Rotate column 
3 down 
one position. 
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Row Clus ter Gr oupi ng 




(Enhanced Torus) 


North & West 
ports cross Cluster 
boundary here. 


South & East 
ports cross Cluster 
boundary here. 

Each row is a 
2x2 Cluster with 
transpose PEs 
together. 



To cut wiring in 1/2, combine the N/W & S/E links in SIMD! 
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Creation of 4x4 ManArray™ 

Grouping the Clusters for a Regular Layout 


Rearrange the 4-PE 
Row Clusters. 

Use the shared links 
between the Clusters. 

This rearrangement 
provides a regular layout 
for physical design. 


WAV.'AW.V.V.V.V.W 


yy 


•ivXwi'y.'XvXwX /// •> v.xi.&yvl v£;I;X\ • 
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Creation of 4x4 ManArray 

Enhancing algorithm performance 


Improve the connectivity of 
the array. 


The Cluster Switch (CS) 
completely connects the 
PEs within the 4 Clusters. 


The connected Clusters 
provide algorithmic 
performance advantages 



Four 2x2 Clusters 
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a Quantum-Leap Beyond the Torus. 
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4D Hypercube 

Definition 

A d-Dimension hypercube 
has 2 d nodes. 

A PE node ID is a 
d-bit binary string. 

PEs are directly connected 
if & only if their IDs differ 
by 1-bit. 

Hypercube long path 

A d=4 hop long path 
is shown between 


PE-0001 & PE- 


incorporated 


ManArray 




Creation of 4x4 ManArray™ 

Hypercube embedding: 



Embed the hypercube by 
encoding the (ij) PE nodes 

with Gray Codes. 


Note that each cluster 
contains the longest 
hypercube paths. 

In ManArray, the furthest 
apart hypercube nodes 
become Cluster neighbors 

and the network diameter is 
cut in half. 
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4D Hypercube = Four 2x2 Clusters 


ManArray advantages 

• For k-dimension hypercubes, 
nodes of distance k become of 
distance 1 on the ManArray 

• All ManArray distances 
are < fk/2] 


Example applications 
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AII-to-AII Communications 
Perfect Shuffle 
Matrix Transposition 
Matrix Multiplication 
DCT/IDCT 
Stockham FFT 
Batcher Sort 


WWW.BOPS.COM 


Physical Attributes 

Single-cycle register-to-register transfers between PEs: 

• Within Clusters and between orthogonal Clusters 
N,E,S,W, Hypercube, Transpose, HyperComplement, 

Enhanced connectivity with 1/2 wires between Clusters 

PEs are completely connected in the Clusters 

Only 2 ports required per PE independent of the topology 

PEs decoupled from the topology 

No global wrap-around wires 

Regular layout for physical design 

Built-in scalability: 1x1, 1x2, 2x2, 2x4, 4x4, 4x4x4 cores 
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Programming Models for the ManArray™ 


v« . .ssv\>vn 
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*y»v*v«v*v*v«v. <v. ty«'A * 
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I 


necessary 


^select st» + PE select Vectors 

? B ^L& ,seWct viiiyn rr-r - -***— — 

ii "ill 1 1 





Programmers increase their code density as they move from 
traditional programming models towards fully optimized eVLIWcode 

Programmers can select any of these models to run independently 

Optimized library of algorithms targeting 3D graphics, MPEG 
encoding, audio, communications, signal processing ... 
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programmable 






One Example Appropriate for Many Applications 



(4x4 Matrix) x (4x4 Matrix) Multiplication (1 6-bit data) 

■>144 Operations per matrix multiply <64mpys+48adds+ioadi6/storei6) 

"♦ 96 Cycles; 1 BOPS Sequence Processor(SP) using multiply-adds 
72 Cycles; 1 BOPS SP using dual 16-bit packed data formats 
32 Cycles; 1 BOPS SP using dual 16-bit packed data & eVLIW 

8 Cycles; 2x2 array using dual 16-bit packed data & eVLIWs 

2 Cycles; 4x4 array using dual 16-bit packed data & eVLIWs 
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2x2 Kitty Hawk 

(our first to fly 




/ Cluster Core Estimates \ 

i 

i Cluster Gates: 
i SRAM: 

i 

•0.25 pm:* 

i 

1 0.35 pm:* 

Avg. Power: 

Peak bops: 


** 


*** 


181 K 
1 6KBytes 

13.8 mm 2 

26.9 mm 2 
1.3 

12.8 


! * Overall size includes the memory & DMA 
l ** Anticipated use @ 100 Mhz and 3.3 V. | 

\ *** 8-bit performance numbers 

\ 
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m 
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lost, 

Mem 


PH 

Control 


/ 


AGP/ 

Customizable 

Mem 

PCI 

Peripherals 

I/O 





Control 
& I/O 







Page 36 


WWW.BOPS.COM 


Microprocessor Forum 


Page 1 8 


October 14-15, 1997 





ManArray Technology: The Scalable Future of Signal Processing 


lOPS wanArray 

INCORPORATED * 



' Cluster Core Estimates ^ 

i 
i 


! Cluster Gates: 1 07K 


SRAM: 


1 0.25 pm:* 

i 

1 0.35 pm:* 

i 

lAvg. Power: 


1 Peak bops: 






14KBytesi 

i 

9.2 mm 2 1 

i 

18.1 mm 2 1 

i 

0.8 ' 

i 

6.4 1 


* Overall size includes the memory & DMA 
** Anticipated use @100 Mhz and 3.3 V. 


8-brt performance numbers 
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2x4 Core 

(Two 2x2’s) 




/ Cluster Core Estimates \ 

i 


i Cluster Gates: 363K 
IsRAM: : 

i 


1 


>+*+4 h 

KBytes * 

i 

lillBiiBii 

|j 


'0.25 pm: 1 


0.35 pm: 1 


jAvg. Power: 
Peak bops: 


.*★ 


•*** 


27.7 mm 2 1 

i 

53.8 mm 2 [ 

2.5 j 

25.6 ! 


* Overall size includes the memory & DMA 
** Anticipated use @100 Mhz and 3.3 V. 
*** 8-bit performance numbers 
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ManArray™Technology: 

The Scalable Future of Signal Processing 


Announcing: 


• Now Licensing the BOPS family of cores to a limited number of 
partners. 


• The Kitty Hawk, a 2x2 synthesizable core with development 
tools, will be available in 1H98.* 


• See us for a sneak preview of our development tools. 


|>i ^Sub£ecU 
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Consulting Services 


Consulting Staff 

Michael Slater 

Founder and Principal 
Analyst, MDR and 
Microprocessor Report 

Comprehensive reviews of 
microprocessor trends and 
future perspectives. 



Linley Gwennap 

Publisher and 
Editor-in-Chief, 
Microprocessor Report 

In-depth critical analysis of 
CISC and RISC microproces- 
sor architectures , semicon- 
ductor manufacturing costs , 
Intel market strategies. 

Peter N. Glaskowsky 

Senior PC Analyst, 
Microprocessor Report 

PC system design and 
components , especially 
in the areas of memory, 
chipsets, networking, 
and the Internet. 




James L. Turley 

Senior Analyst, 
Microprocessor Report 

Specializing in high- 
performance embedded 
microprocessors, especially 
in the areas of consumer 
electronics & industrial 
applications. 



The Industry's Leading Analysts Bring 
Their Perspective to You 

MDR (MicroDesign Resources) offers a full range of consulting 
services to firms trying to find winning technology strategies 
in both the microprocessor and personal computer industries. 
Our analysts can: 

• Assist in identifying your best opportunities for growth 

• Evaluate new technologies you are developing 
or adopting 

• Identify and capitalize on specific technologies that 
hold promise 

• Summarize your opportunities and challenges in 
the market 

Technology Consulting from Experts 

MDR’s consulting practice is based on providing customized, 
detailed answers to your most pressing technology development 
and adoption problems. 

We provide insight tempered by our years of watching micro- 
processor and PC technology evolve as editors of Microprocessor 
Report 

Consulting projects are accepted selectively, where MDR’s 
analysts unique strengths can be best leveraged for your benefit. 
Consulting is arranged at either daily or project rates. 

Scheduling 

Simply contact MDR at 707.824.4001, 800.527.0288, or by email 
at cs@mdr.zd.com. We’ll give you complete information on 
availability and costs. To find out more about our consulting 
services staff, visit our website at http://www.MDRonline.com 


Mel Thomsen 

Director, 

MDR Consulting Services 

Executive perspective on 
semiconductor industry 
trends and market dynamics. 
DRAM technology and 
market trends. 


Peter Song 

Senior Analyst, 
MDR 

Analysis of high 
performance x86 
microprocessors 





Peter Christy 

President, 
MDR 


Trends in microsystem and 
communication technologies 
and markets. 


874 Gravenstein Hwy. So. r Sebastopol, CA 95472 
Phone: 800.527.0288, Fax: 707.823.0504, http://www.MDRonline.com 
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These offers good only until Nov. 30, 1997 


Microprocessor Report Subscriptions 


Choose either $50 off your subscription or a 25th Anniversary 
of the Microprocessor Poster (applies only to new subscribers) 



Two years 


U.S.& 

Canada 

Europe 

Elsewhere 

$595 

£450 

$695 

$1095 

£795 

$1295 


$50 off Poster 


Add local sales tax on subscriptions mailed to GA, KY, MA, TX, WA, or add GST in Canada 


Sub-Total 


Microprocessor Report Library on CD-ROM 

(Special rates are for new subscribers only) 


One year (4 quarterly updates) 


Single Issue 



3 


Technical Library 



395 


$125 


$345 

$345 

$125 

$125 


Sub-Total 


Orders for 
additional 
copies* must 
be within 30 
days of initial 
purchase, for 
use within the 
same company 
division. 


Beyond Conventional 3D: Talisman and Other Advanced Architectures 


Intel's Merced and IA-64: Technology and Market Forecast 


Battle for the Desktop: Strategies for x86 Microprocessors 


Selecting a High-Performance Embedded Microprocessor, 2nd Edition 


Intel Microprocessor Forecast, 2nd Edition 


New DRAM Technologies, 2nd Edition 


Buyer's Guide to DSP Processors, 3rd Edition 


DSP on General-Purpose Processors 


EH 

$895 

$945 

$250 


$895 

$945 

$250 

$1045 

$895 

$945 

$250 


$995 

$1045 

$295 

$1045 

$895 

$945 

$250 

$2695 

$1995 

$1995 

$295 

$2600 

i|— — 

$2500 





Sub-Total 



Microprocessor Forum 97 Materials 


4 



25th 

Anniversary 

Poster 


@ $30 


Sub-Total 


Conference Proceedings: 

including audio tapes and speaker handouts $295 

Seminar Workbooks: 

$95 each 

Microprocessors for PCs: 

A Critical Look at Technologies and Future Directions, by Michael Slater 

3D Graphics and Multirm 

*dia: Chips and Choices, by Peter N . Glaskowsky 

Comparing High-Perforn 

lance Microprocessor Designs, by Linley Gwennap 

Evaluating Microprocesso 

>rs for Embedded Applications, by Jim Turley 



Less 10% if ordering more than one seminar workbook 


Sub-Total 





Product Total (Do not include Microprocessor Report Subscription) 


Product Sales Tax 

Add applicable sales tax for the following states: CA, CT, GA, IL, KY, MA, MN, NY, NC, TX, WA, or GST tax in Canada 

International orders please add $20 shipping plus $10 for each additional item 


I Microprocessor Report Subscription Sub-Total 


TOTAL 

Method of Payment 

□ Enclosed please find my check. 

(Payable to MicroDesign Resources in 
U.S. dollars, drawn on a U.S. bank) 

□ Bill my company. (Must attach a P.O.) 

□ Charge my: □ VISA □ MasterCard 

□ AmEx □ Diners Club 



Company 


Mail Stop 



Card # Exp. Date 

Signature (Required on all credit card orders) ^ ^ ^ ^ ^ ^ ^ 

M PF97F 

MlCRODESIGN 874 Gravenstein Hwy. South, Sebastopol, CA 95472 

Phone: 707.824.4001 or 800.527.0288 Fax: 707.823.0504 email: cs@mdr.zd.com Web: www.MDRonline.com 
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Business Action 


Microprocessor Report 


The one source you can count on for credible 
in-depth analysis, technical detail, and industry insight 

Microprocessor Report has been the leading technical publication for the microproces- 
sor industry since it was first introduced by Michael Slater in 1987. Published 17 
times a year, Microprocessor Report is an independent, exclusively subscriber-based 
publication focused on the constant advances in microprocessor technology. 

Written by engineers for a technically astute audience, you’ll find: 

• New microprocessor announcements and reviews of new chips 

• In-depth analysis of microprocessor architectures and implementations 

• The microprocessor implications of emerging platforms 

• Emerging personal computer technologies, workstation designs, handheld system 
architectures, embedded processors, DSP technology, and intellectual property issues 
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National Semi Acquires Cyrix 

Cyrix Gains Licensed Fab; National Touts PC on a Chip 




by Unity Gwtnnap 

This article it an expanded and updated version of the 
newt flash distributed with our last issue 

National Semiconductor has leapt m to the PC proce*- 
sot market by acquiring x86 vendor Cyrix for an estimated 
$550 million in stock. The deal positions the combined com- 
pany as a stronger force in the PC processor market, bringing 
Cyrix’s xS6 designers together with National s fab capacity. 
National recently opened a 0.35-mlcron fab. competitive 
with the best process technology that Cyrix is getting from iu 
current foundry, IBM. National also holds the essential Intel 
patent cross license agreement that will allow it to legally 
manufacture and sell Intel compatible processors. 

From Nationals standpoint, the deal allows it to extend 
lu line of PC-on-a-chip products, currently embodied by its 
NS486 processor (see MPR 9/1 1/95. p. 1). The Cyrix proces- 
sors will provide more powerful cores for future highly inte- 
grated products. National CEO Brian Halla envisions a 
future u. which PC compatible ‘appliances’ sell for $500 or 
even $200. with sales of these low oost devices far outstrip 
ping those of traditional PCs. 

How these two strategies will play together in the long 
term is undear. The initial press announcement was some- 
what schizophrenic, with Cyrix stressing head-to-head com- 
petition with Intel while National honed in on the PC-on-o- 
chip concept We expect the new company to pursue both 


Aftei Arne lk> left to take a 


1 1996 , 


Halla took charge at National and immediately gave the 
green light for the Portland project to proceed at foil speed. 
He also expanded the size of the facility significantly, boost- 
ing the total investment to more than $1 billion over three 
years. The Portland fab went from shovel to sample silicon in 
18 months, and National wiD begin shipping production 
parts from the new fab next month. 

National also vastly increased funding for IC process 
development, hiring mtny experienced hand* to accelerate 
development of a 0J3- micron process. Halla wanted the 
Portland fab to deploy the 0.35-micron process when it 
opened, with a goal of reaching the 0.25-mkron level in 1998 
and 0.18-mkron in 1999. After spending $70 million for 
process development in 1996 and an expected $100 million 
this year. National has met iu initial goal; the first produc- 
tion parts from the Portland fab are being built in a 0.55- 
micron (effective) CMOS process. 

National hope* to make a fast transition to the 0.25- 
micron leveL The initial process is designed such that 7096 of 
the equipment can also be used to build 0.25-micron parts; 
this high degree of commonality should smooth the transi- 
tion. The company expects its 0.25-micron process to be 
fully qualified in iu Santa Clara (Calif.) development fob by 
December, with volume production in Portland around the 
middle of 1998 — still a year behind the industry leaders. 


mayd 


s lor an initial p 
oae to focus on one or the other. 


■% it Plenty of Capacity for Cyrix Parts 




Halla Revitalizes National's Fab Technology 

Under former CBO GB Amello, National’s fab technology 
hod languished; as recently as 1995, National's best process 
was 0.8- micron (drawn), while many other chip vendors 

lio’s cost cutting measure* left little funding for expensive 
new fobs In the fall of 1995. National did break ground on a 
new fob in South Portland, Maine, but continued funding 
problems slowed progress on the new facility. 


Halla dearly has big plans for the Portland fob. By the end 
of 1999, the fob will be built out to a capacity of 7,000 wafers 
per week, roughly the size of AMD's Fab 25 or Intel's largest 
fobs. This capacity wiH easily support a doubling or tripling 
of Cyrix's market share in addition to National’s current 
products. If necessary, the entire fob could be devoted to s96 
chips; National's other products can be fabricated offshore 
through an existing arrangement with foundry TSMC. 

National also intend* to maintain Cyrix’s existing rela- 
tionship with IBM. which currently manufactures chips far 
Continued on page 6 
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Microprocessor Report Library on CD-ROM 

Searchable archive of all Microprocessor Reports since 1992 

At the click of a mouse, subscribers to the Microprocessor Report Library on CD-ROM can 
access every article, IC announcement, editorial, chart or graph from Microprocessor Report 
since 1992. Five years of insight and analysis, updated in its entirety four times a year, are 
delivered in a conveniently accessible and searchable form, from the most respected team in 
the industry. Follow industry trends. With a single search, you can access a virtual timeline 
of progress in your interest areas. It’S powerful and easy to use. You’ll save time you might 
otherwise spend by manually searching through back issues. Put together using Adobe Acrobat, 
your searches can be based on full text, article titles, authors, article types, or newsletter issue. 


Technical Library 


Beyond Conventional 3D: 

Talisman and Other 
Advanced Architectures 

Peter N. Glaskowsky, 

senior PC analyst, MDR 

Interest in 3D graphics is exploding, but conventional 
architectures are already starting to run out of gas. To 
continue the rapid pace of 3D performance advances 
(doubling performance every six months), radical new 
approaches will be needed as early as next year. Microsoft’s 
Talisman is one such approach, but others are offering 
competing designs that may win out. 

Beyond Conventional 3D first describes the basic technical 
concepts of the conventional PC 3D pipeline. It then offers 
an insightful exploration of the most promising opportuni- 
t ies for enhanced quality and performance, and which 
companies are likely to deliver on them. 


The most knowledgeable decision makers in the microprocessor 
industry are continually challenged by new technologies, conflict- 
ing claims, and emerging trends. MDR’s Technical Library and 
Special Reports help clarify choices by providing in-depth explo- 
ration of key issues and the technologies driving the industry 








NEW RELEASE! 


w*** ,*ST da ' e 


In nearly 100 pages, you’ll find: 

• Objective metrics to compare 3D architectures including 
display quality, rendering speed, and chip complexity 

• An exploration of alternatives to the conventional PC 
pipeline, including Microsoft’s Talisman and HP PixelFlow 

• Specific techniques to improve display quality, such as 
increased resolution, color depth, and polygon precision; 
lighting and texture enhancements and more 

• An analysis of market conditions that will predispose 
some companies to succeed, others to collapse 









Intel’s Merced and IA-64: 
Technology and Market Forecast 

Linley Gwennap, publisher and editor in chief, 

Microprocessor Report 


NEW RELEASE! 


Even as competition for x86 proces- 
sors reaches a peak, Intel is readying its next generation of 
microprocessors based on the architecture codeveloped with 
Hewlett-Packard, dubbed IA-64. Intel’s goals for Merced, the 
first IA-64 processor, are simple: to deliver the fastest processor 
in the world for high-end workstations and servers, providing 
a strong new competitor for the RISC vendors that dominate 
those markets today. Over time, IA-64 is likely to filter down 
into the PC market as well, ultimately displacing x86. 

Can IA-64 deliver on its promise? In the first detailed technical 
report on the subject, Intel expert Linley Gwennap looks at: 

• IA-64 design philosophy, including IC technology, 
system design, and rationale for a new ISA 

• IA-64 software model and its combination release 
of two instruction sets in one processor 

• Details of IA-64’s 64-bit extensions including instruction 
formats, branching behavior, and operation grouping 

• MDR’s forecast for Merced including clock speed, die size, 
and peformance 

• MDR’s projected roadmap for Intel’s IA-64 processor family 


Battle for the Desktop: Strategies 
for x86 Microprocessors 


NEW 

RELEASE! 


Michael Slater, founder 
and editorial director, MDR 

The next twelve months promise one of the most com- 
petitive and demanding markets for x86 processors, with 
AMD, Cyrix/National, and IDT/Centaur all offering 
viable alternatives to Intel. But in the battle for the desk- 
top, can these alternatives attain a sustainable, profitable 
market share? This insightful analysis by Michael Slater 
gives you detailed information on the products and 
strategies of each x86 vendor. It covers key issues such as 
the ability of Intel’s competitors to maintain Socket 7’s 
viability in the face of Intel’s move to Slot 1, and the role 
of instruction set extensions beyond MMX. 

Battle for the Desktop gives you a detailed review of the 
shifting PC environment with its changing user demands, 
unique market segments, and future trends. It then lays 
out Intel’s and each x86 competitor’s chips and processor 
roadmap against the market demands and the offerings of 
the other companies. For each x86 player, you’ll get: 

• Corporate histories and key statistics 

• Customers Vea$e iW 

• Fab capacity 199^ 

• Legal strategy 

• Roadmaps 

• Detailed descriptions of the processors themselves 


Buyer’s Guide to DSP Processors, 
3rd Edition 

Berkeley Design Technology 


Fully updated and revised, this guide 
to digital-signal processors provides 
key insights into each processor’s 
strengths and weaknesses, as well as 
complete tables to directly compare 
processors for particular features or 
performance metrics. Written by 
DSP experts, Berkeley Design 
Technology, this report evaluates 
processor performance based on 
BDT’s own benchmarks. 

Processors evaluated include: 

• Texas Instruments TMS320C62xx 

• Motorola DSP563xx 

• Motorola DSP566xx 

• Motorola DSP568xx 

• Analog Devices ADSP-21 cspxx 

• Plus 14 other processor families 


to DSP 
Processors 


DSP on General-Purpose Processors 

Berkeley Design Technology 

In this unique report, Berkeley Design Technology analysts 
examine today’s general-purpose processors, 
evaluating how well they meet the needs of DSP 
applications. Processors are objectively evaluated 
based on performance, architecture, pitfalls, devel- 
opment tools, software, price, and packaging — 
critical characteristics to help you 
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decide which is best for 
your application. 

Processors evaluated include: 

• Intel Pentium with MMX 

• IBM/Motorola PowerPC 604 

• Hitachi SH-DSP 

• Advanced RISC Machines 
ARM 7 

Integrated Device Technology 
R4650 
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Intel Microprocessor Forecast , 

2id Edition 

Linley Gwennup, publisher and editor in chief, 

M icroprocessor Report 

Short of sitting in on an Intel microprocessor strategy meeting, 
yo i just cannot get a better look at Intel's plans and potentials. 

Whether you compete or 
collaborate with Intel, you 
need solid data on where 
the industry’s most pow- 
erful semiconductor firm 
is headed. In nearly 100 
pages of text, charts, and 
figures (all in full color), 
you’ll find the most 
detailed forecast available 
of Intel’s manufacturing 
1} capacity and costs, prod- 
uct roadmap, pricing, and 
unit shipments. 

Intel Microprocessor 
Forecast, 2nd Edition, 
provides the details that strategists, investors, and decision- 
makers need in order to plan for upcoming changes in the 
Ini el product mix. You’ll find: 

• Schedule and performance estimates for Katmai, 
Willamette, Deschutes, and Merced (IA-64) 

• Manufacturing cost estimates for current and future Intel 
processors 

• Price forecasts through the end of 1998 

• Unit shipment forecasts through the end of 1999 



From the 
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Microprocessor Report 


Selecting a High-Performance 
Embedded Microprocessor, 


A TECHNICAL REPORT FROM 
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Selecting a 
High-Performance 
Embedded 
Microprocessor 


Jim Turley 


Second 

Edition 



2nd Edition 

Jim Turley, 
senior editor, 

Microprocessor 
Report 

Some of the fastest, most 
complex, and highest vol- 
ume chips have been 
designed for embedded 
applications. Which is 
right for yours? With more 
than 100 32-bit embedded 
microprocessors to choose 
from, it can take months to 
find the right one. That’s 
why the fully revised and updated second edition of Selecting 
a High-Performance Embedded Microprocessor is such a valu- 
able decision-making resource. 

This 450-page volume, filled with easy-to-use graphs, tables, 
and charts that compare performance, costs, and power con- 
sumption, is an indispensable time saver and guide. You’ll 
find: 

• Analysis and specifications for over 115 chips and 1 7 archi- 
tectures including reviews of the latest chips like picojava, 
PowerPCs and more. Every chip is analyzed based on the 
metrics you’re most likely to need: cost, performance, 
power requirements, availability, and prospects 

• Processor selection guides which make it easy to find chips 
that meet your specifications 

• Independent analysis of each architecture from the most 
trusted source in the industry 


hew DRAM Technologies: A 
Comprehensive Analysis of the New 
Architectures, 2nd Edition 

Sveven Przybylski, Ph.D. 

This one-of-a-kind, comprehensive resource is designed to 
help you understand the DRAM choices and trends that are 
m Dving DRAM technology to new performance peaks. The 
second edition delivers an expert analysis of system-level 
implications, side-by-side architecture comparisons, and a 
hi itorical perspective. 

Yc u’ll find: 

• Current trend analysis 

• A new metric to compare DRAM architectures 

• A look beyond peak bandwidth to the more realistic sus- 
tained bandwidth in systems 
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Changes, directions, and opportunities in the 64-Mb gen 
eration of DRAM and new challenges at the 256-Mb and 
1-Gb levels 

Detailed discussion of the 
ongoing incompatibilities 
among SDRAMs 
Analysis of the system- 
level impact of high- 
performance memories 
with and without 
secondary cache 
Thorough tutorial 
material, including: 
what DRAMs are, 
how they work, and 
how they are used 
together to form 
memory systems 
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Commemorative Poster 
25th Anniversary of the Microprocessor 


A collaborative venture between MDR and The 
Computer Museum History Center, this full-color 
poster is prized by collectors as both a great reference 
and a handsome piece of art. 

• 24" x 36" (fits a standard size frame) 

• Die photos of over 130 processors, arranged by 
family and year of introduction, with each die 
shown in proportion to their actual size 

• Covers microprocessors from the Intel 4004 to the 
latest high-performance chips 

• Shows transistor count for each processor 

• Protectively UV coated and shipped in sturdy 
cardboard tubes 
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Seminar Workbooks 

Buy more than one Seminar Workbook and 
receive a 10% discount. 

• Microprocessors for PCs: A Critical Look at Technologies 
and Future Directions, Michael Slater 

• 3D Graphics and Multimedia: Chips and Choices, 

Peter N. Glaskowsky 

• Comparing High-Performance Microprocessor Designs, 
Linley Gwennap 

• Evaluating Microprocessors for Embedded Applications, 
Jim Turley 

If the current price list and order form insert is missing, 
or if someone in your company would like to place a sep- 
arate order, please contact Customer Service at MDR and 
we’ll gladly get one to you. 

Visit our website for further details. Outlines and greater 
details on our Technical Library are posted as soon as 
they become available, often prior to publication. You can 
also order directly on the Web. 


Microprocessor Forum 97 Materials 

The full value you discover at MDR Forums — insightful, indepen- 
dent analysis, in-depth presentations, lively discussions, and an 
unmatched level of detail — can be appreciated and recollected eas- 
ily when you have audiotapes and speaker handouts available for 
your quick reference back in the office. Seminar Workbooks pro- 
vide valuable perspective on the emerging technologies addressed 
at Microprocessor Forum. 

Conference Proceedings 

Including audiotape and speaker handouts 

for all sessions 

• Keynote Speech: A New World Order — Alternative 
Microsoft Windows Platforms, Jerry Sanders , AMD 

• x86 Processors 

• The IA-64 64-Bit Instruction Set Architecture 

• Thoughts on Computer Architecture for the Future 

• High-Performance RISC Microprocessors 

• Microprocessors for Embedded Applications 

• Next-Generation DRAMs 

• Multimedia Acceleration 

• Panel: Opportunities for Future Microprocessors 


TO ORDER 




707.824.4001 or 800.527.0288 
707.823.0504 
cs@mdr.zd.com 
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